Skip to main content
Top
Published in: International Journal of Parallel Programming 5/2015

01-10-2015

Steal Locally, Share Globally

A Strategy for Multiprogramming in the Manycore Era

Authors: Ashkan Tousimojarad, Wim Vanderbauwhede

Published in: International Journal of Parallel Programming | Issue 5/2015

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In a general-purpose computing system, several parallel applications run simultaneously on the same platform. Even if each application is highly tuned for that specific platform, additional performance issues are arising in such a dynamic environment in which multiple applications compete for the resources. Different scheduling and resource management techniques have been proposed either at operating system or user level to improve the performance of concurrent workloads. In this paper, we propose a task-based strategy called “Steal Locally, Share Globally” implemented in the runtime of our parallel programming model GPRM (Glasgow Parallel Reduction Machine). We have chosen a state-of-the-art manycore parallel machine, the Intel Xeon Phi, to compare GPRM with some well-known parallel programming models, OpenMP, Intel Cilk Plus and Intel TBB, in both single-programming and multiprogramming scenarios. We show that GPRM not only performs well for single workloads, but also outperforms the other models for multiprogramming workloads. There are three considerations regarding our task-based scheme: (i) It is implemented inside the parallel framework, not as a separate layer; (ii) It improves the performance without the need to change the number of threads for each application (iii) It can be further tuned and improved, not only for the GPRM applications, but for other equivalent parallel programming models.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
1
TCB is called “subtask list” in the previous GPRM papers.
 
2
These features can be enabled via command-line switches when compiling the GPRM runtime system. It that sense they can be considered as runtime support features.
 
3
We use the noun “steal” (OED “the act of stealing”) rather than “theft”.
 
4
The “sequential tile” is responsible for the sequential tasks, but also contributes to the parallel execution, whenever required.
 
5
For all experiments, results from the benchmarks kernel are considered in the figures (a) and (b), while in the other results taken from the VTune Amplifier, all information from the start of the application, including its initial phase and the CPU time consumed by the shared libraries is taken into account.
 
6
The sequential phase of the MergeSort benchmark with the input size 80M is around 2 s, and the initial phase of the MatMul benchmark with the input size 4096\(\times \)4096 is about half a second.
 
Literature
1.
go back to reference Arora, N.S., Blumofe, R.D., Plaxton, C.G.: Thread scheduling for multiprogrammed multiprocessors. Theory Comput. Syst. 34(2), 115–144 (2001)MathSciNetCrossRefMATH Arora, N.S., Blumofe, R.D., Plaxton, C.G.: Thread scheduling for multiprogrammed multiprocessors. Theory Comput. Syst. 34(2), 115–144 (2001)MathSciNetCrossRefMATH
2.
go back to reference Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The design of openmp tasks. IEEE Trans. Parallel Distrib. Syst. 20(3), 404–418 (2009)CrossRef Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The design of openmp tasks. IEEE Trans. Parallel Distrib. Syst. 20(3), 404–418 (2009)CrossRef
3.
go back to reference Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37(1), 55–69 (1996)CrossRef Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37(1), 55–69 (1996)CrossRef
5.
go back to reference Clet-Ortega, J., Carribault, P., Pérache, M.: Evaluation of openmp task scheduling algorithms for large numa architectures. In: Euro-Par 2014 Parallel Processing, pp. 596–607. Springer, New York (2014) Clet-Ortega, J., Carribault, P., Pérache, M.: Evaluation of openmp task scheduling algorithms for large numa architectures. In: Euro-Par 2014 Parallel Processing, pp. 596–607. Springer, New York (2014)
6.
go back to reference Duran, A., Corbalán, J., Ayguadé, E.: An adaptive cut-off for task parallelism. In: International Conference for High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. pp. 1–11. IEEE (2008) Duran, A., Corbalán, J., Ayguadé, E.: An adaptive cut-off for task parallelism. In: International Conference for High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. pp. 1–11. IEEE (2008)
7.
go back to reference Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona openmp tasks suite: a set of benchmarks targeting the exploitation of task parallelism in openmp. In: International Conference on Parallel Processing, 2009. ICPP’09. pp. 124–131. IEEE (2009) Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona openmp tasks suite: a set of benchmarks targeting the exploitation of task parallelism in openmp. In: International Conference on Parallel Processing, 2009. ICPP’09. pp. 124–131. IEEE (2009)
8.
go back to reference Emani, M.K., Wang, Z., O’Boyle, M.F.: Smart, adaptive mapping of parallelism in the presence of external workload. In: 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1–10. IEEE (2013) Emani, M.K., Wang, Z., O’Boyle, M.F.: Smart, adaptive mapping of parallelism in the presence of external workload. In: 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1–10. IEEE (2013)
9.
go back to reference Eyerman, S., Eeckhout, L.: System-level performance metrics for multiprogram workloads. IEEE Micro 28(3), 42–53 (2008)CrossRef Eyerman, S., Eeckhout, L.: System-level performance metrics for multiprogram workloads. IEEE Micro 28(3), 42–53 (2008)CrossRef
10.
go back to reference Harris, T., Maas, M., Marathe, V.J.: Callisto: co-scheduling parallel runtime systems. In: Proceedings of the 9th European Conference on Computer Systems, p. 24. ACM (2014) Harris, T., Maas, M., Marathe, V.J.: Callisto: co-scheduling parallel runtime systems. In: Proceedings of the 9th European Conference on Computer Systems, p. 24. ACM (2014)
11.
go back to reference Hofmeyr, S., Iancu, C., Blagojević, F.: Load balancing on speed. In: ACM Sigplan Notices, vol. 45, pp. 147–158. ACM (2010) Hofmeyr, S., Iancu, C., Blagojević, F.: Load balancing on speed. In: ACM Sigplan Notices, vol. 45, pp. 147–158. ACM (2010)
12.
go back to reference Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming. Newnes (2013) Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming. Newnes (2013)
13.
go back to reference Kim, W., Voss, M.: Multicore desktop programming with intel threading building blocks. IEEE Softw. 28(1), 23–31 (2011)CrossRef Kim, W., Voss, M.: Multicore desktop programming with intel threading building blocks. IEEE Softw. 28(1), 23–31 (2011)CrossRef
14.
go back to reference Lubin, M., McMillan, S., Kruse, C.G., Del Vento, D., Montuoro, R.: Efficient software development: 4 Whats new in intel parallel studio xe 2013 service pack (2013) Lubin, M., McMillan, S., Kruse, C.G., Del Vento, D., Montuoro, R.: Efficient software development: 4 Whats new in intel parallel studio xe 2013 service pack (2013)
15.
go back to reference McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Elsevier (2012) McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Elsevier (2012)
16.
go back to reference Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media, Inc. (2007) Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media, Inc. (2007)
17.
go back to reference Sasaki, H., Tanimoto, T., Inoue, K., Nakamura, H.: Scalability-based manycore partitioning. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, pp. 107–116. ACM (2012) Sasaki, H., Tanimoto, T., Inoue, K., Nakamura, H.: Scalability-based manycore partitioning. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, pp. 107–116. ACM (2012)
18.
go back to reference Saule, E., Catalyurek, U.V.: An early evaluation of the scalability of graph algorithms on the intel mic architecture. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW) pp. 1629–1639. IEEE (2012) Saule, E., Catalyurek, U.V.: An early evaluation of the scalability of graph algorithms on the intel mic architecture. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW) pp. 1629–1639. IEEE (2012)
19.
go back to reference Sussman, G.J., Jr., G.L.S.: Scheme: an interpreter for extended lambda calculus. In: MEMO 349, MIT AI LAB (1975) Sussman, G.J., Jr., G.L.S.: Scheme: an interpreter for extended lambda calculus. In: MEMO 349, MIT AI LAB (1975)
20.
go back to reference Teruel, X., Martorell, X., Duran, A., Ferrer, R., Ayguadé, E.: Support for openmp tasks in nanos v4. In: Proceedings of the 2007 Conference of the Center for Advanced Studies on Collaborative Research, pp. 256–259. IBM Corp. (2007) Teruel, X., Martorell, X., Duran, A., Ferrer, R., Ayguadé, E.: Support for openmp tasks in nanos v4. In: Proceedings of the 2007 Conference of the Center for Advanced Studies on Collaborative Research, pp. 256–259. IBM Corp. (2007)
21.
go back to reference Tousimojarad, A., Vanderbauwhede, W.: The Glasgow Parallel Reduction Machine: Programming Shared-Memory Many-Core Systems Using Parallel Task Composition. EPTCS 137, 79–94 (2013). doi:10.4204/EPTCS.137.7 Tousimojarad, A., Vanderbauwhede, W.: The Glasgow Parallel Reduction Machine: Programming Shared-Memory Many-Core Systems Using Parallel Task Composition. EPTCS 137, 79–94 (2013). doi:10.​4204/​EPTCS.​137.​7
22.
go back to reference Tousimojarad, A., Vanderbauwhede, W.: Comparison of three popular parallel programming models on the Intel Xeon Phi. In: Euro-Par 2014: Parallel Processing Workshops, pp. 314–325. Springer, New York (2014) Tousimojarad, A., Vanderbauwhede, W.: Comparison of three popular parallel programming models on the Intel Xeon Phi. In: Euro-Par 2014: Parallel Processing Workshops, pp. 314–325. Springer, New York (2014)
23.
go back to reference Tousimojarad, A., Vanderbauwhede, W.: An efficient thread mapping strategy for multiprogramming on manycore processors. In: Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing, vol. 25, pp. 63–71. IOS Press (2014). doi:10.3233/978-1-61499-381-0-63 Tousimojarad, A., Vanderbauwhede, W.: An efficient thread mapping strategy for multiprogramming on manycore processors. In: Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing, vol. 25, pp. 63–71. IOS Press (2014). doi:10.​3233/​978-1-61499-381-0-63
24.
go back to reference Tousimojarad, A., Vanderbauwhede, W.: A parallel task-based approach to linear algebra. In: 2014 IEEE 13th International Symposium on Parallel and Distributed Computing (ISPDC), pp. 59–66. IEEE (2014) Tousimojarad, A., Vanderbauwhede, W.: A parallel task-based approach to linear algebra. In: 2014 IEEE 13th International Symposium on Parallel and Distributed Computing (ISPDC), pp. 59–66. IEEE (2014)
25.
go back to reference Tucker, A.: Efficient Scheduling on Multiprogrammed Shared-memory Multiprocessors. Ph.D. thesis, Stanford University (1994) Tucker, A.: Efficient Scheduling on Multiprogrammed Shared-memory Multiprocessors. Ph.D. thesis, Stanford University (1994)
26.
go back to reference Veen, A.H.: Dataflow machine architecture. ACM Comput. Surv. (CSUR) 18(4), 365–396 (1986)CrossRef Veen, A.H.: Dataflow machine architecture. ACM Comput. Surv. (CSUR) 18(4), 365–396 (1986)CrossRef
27.
go back to reference Yan, J., He, J., Han, W., Chen, W., Zheng, W.: How openmp applications get more benefit from many-core era. In: Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More, pp. 83–95. Springer, New York (2010) Yan, J., He, J., Han, W., Chen, W., Zheng, W.: How openmp applications get more benefit from many-core era. In: Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More, pp. 83–95. Springer, New York (2010)
Metadata
Title
Steal Locally, Share Globally
A Strategy for Multiprogramming in the Manycore Era
Authors
Ashkan Tousimojarad
Wim Vanderbauwhede
Publication date
01-10-2015
Publisher
Springer US
Published in
International Journal of Parallel Programming / Issue 5/2015
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-015-0350-0

Other articles of this Issue 5/2015

International Journal of Parallel Programming 5/2015 Go to the issue

Premium Partner