Skip to main content
Top

2018 | OriginalPaper | Chapter

OpenMP Loop Scheduling Revisited: Making a Case for More Schedules

Authors : Florina M. Ciorba, Christian Iwainsky, Patrick Buder

Published in: Evolving OpenMP for Evolving Architectures

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In light of continued advances in loop scheduling, this work revisits the OpenMP loop scheduling by outlining the current state of the art in loop scheduling and presenting evidence that the existing OpenMP schedules are insufficient for all combinations of applications, systems, and their characteristics. A review of the state of the art shows that due to the specifics of the parallel applications, the variety of computing platforms, and the numerous performance degradation factors, no single loop scheduling technique can be a ‘one-fits-all’ solution to effectively optimize the performance of all parallel applications in all situations. The impact of irregularity in computational workloads and hardware systems, including operating system noise, on the performance of parallel applications results in performance loss and has often been neglected in loop scheduling research, in particular the context of OpenMP schedules. Existing dynamic loop self-scheduling techniques, such as trapezoid self-scheduling, factoring and weighted factoring, offer an unexplored potential to alleviate this degradation in OpenMP due to the fact that they explicitly target the minimization of load imbalance and scheduling overhead. Through theoretical and experimental evaluation, this work shows that these loop self-scheduling methods provide a benefit in the context of OpenMP. In conclusion, OpenMP must include more schedules to offer a broader performance coverage of applications executing on an increasing variety of heterogeneous shared memory computing platforms.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
Tasks and loop iterations denote independent units of computation and are used interchangeably in this work.
 
2
Task denotes, in this context, the form of concurrency at the level of a programming paradigm, e.g., OpenMP 3.0 tasks.
 
3
For simplicity and consistency with prior work, PSS is herein denoted SS.
 
4
Here chunk denotes the smallest chunk size to be scheduled.
 
5
Here chunk denotes a fixed chunk size to be scheduled.
 
7
In non-adaptive dynamic scheduling, the schedule does not adapt to runtime performance observations other than to the dynamic consumption of tasks.
 
Literature
5.
go back to reference Bailey, D.H., et al.: The NAS parallel benchmarks: summary and preliminary results. In: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing 1991, pp. 158–165. ACM, New York (1991) Bailey, D.H., et al.: The NAS parallel benchmarks: summary and preliminary results. In: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing 1991, pp. 158–165. ACM, New York (1991)
6.
go back to reference Banicescu, I.: Load Balancing and data locality in the parallelization of the fast multipole algorithm. Ph.D. thesis, New York Polytechnic University (1996) Banicescu, I.: Load Balancing and data locality in the parallelization of the fast multipole algorithm. Ph.D. thesis, New York Polytechnic University (1996)
7.
go back to reference Banicescu, I., Flynn Hummel, S.: Balancing processor loads and exploiting data locality in N-body simulations. In: Proceedings of IEEE/ACM SC 1995 Conference on Supercomputing, p. 43 (1995) Banicescu, I., Flynn Hummel, S.: Balancing processor loads and exploiting data locality in N-body simulations. In: Proceedings of IEEE/ACM SC 1995 Conference on Supercomputing, p. 43 (1995)
8.
go back to reference Banicescu, I., Liu, Z.: Adaptive factoring: a dynamic scheduling method tuned to the rate of weight changes. In: Proceedings of 8th High Performance Computing Symposium, pp. 122–129. Society for Computer Simulation International (2000) Banicescu, I., Liu, Z.: Adaptive factoring: a dynamic scheduling method tuned to the rate of weight changes. In: Proceedings of 8th High Performance Computing Symposium, pp. 122–129. Society for Computer Simulation International (2000)
9.
go back to reference Bast, H.: Dynamic scheduling with incomplete information. In: Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA 1998, pp. 182–191. ACM, New York (1998) Bast, H.: Dynamic scheduling with incomplete information. In: Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA 1998, pp. 182–191. ACM, New York (1998)
11.
go back to reference Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)MathSciNetCrossRef Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)MathSciNetCrossRef
12.
go back to reference Buder, P.: Evaluation and analysis of dynamic loop scheduling in OpenMP. Master’s thesis, University of Basel, November 2017 Buder, P.: Evaluation and analysis of dynamic loop scheduling in OpenMP. Master’s thesis, University of Basel, November 2017
14.
go back to reference Che, S., Sheaffer, J.W., Boyer, M., Szafaryn, L.G., Wang, L., Skadron, K.: A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC 2010), pp. 1–11. IEEE Computer Society, Washington, DC (2010) Che, S., Sheaffer, J.W., Boyer, M., Szafaryn, L.G., Wang, L., Skadron, K.: A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC 2010), pp. 1–11. IEEE Computer Society, Washington, DC (2010)
15.
go back to reference Dongarra, J., Beckman, P., et al.: The international exascale software roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011)CrossRef Dongarra, J., Beckman, P., et al.: The international exascale software roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011)CrossRef
16.
go back to reference Dorta, A.J., Rodriguez, C., Sande, F.d., Gonzalez-Escribano, A.: The OpenMP source code repository. In: Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP 2005, pp. 244–250. IEEE Computer Society, Washington, DC (2005) Dorta, A.J., Rodriguez, C., Sande, F.d., Gonzalez-Escribano, A.: The OpenMP source code repository. In: Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP 2005, pp. 244–250. IEEE Computer Society, Washington, DC (2005)
18.
go back to reference Durand, M.D., Montaut, T., Kervella, L., Jalby, W.: Impact of memory contention on dynamic scheduling on NUMA Multiprocessors. In: Proceedings of International Conference on Parallel Processing, vol. 1, pp. 258–262, August 1993 Durand, M.D., Montaut, T., Kervella, L., Jalby, W.: Impact of memory contention on dynamic scheduling on NUMA Multiprocessors. In: Proceedings of International Conference on Parallel Processing, vol. 1, pp. 258–262, August 1993
19.
20.
go back to reference Flynn Hummel, S., Schmidt, J., Uma, R.N., Wein, J.: Load-sharing in heterogeneous systems via weighted factoring. In: Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA 1996, pp. 318–328. ACM, New York (1996) Flynn Hummel, S., Schmidt, J., Uma, R.N., Wein, J.: Load-sharing in heterogeneous systems via weighted factoring. In: Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA 1996, pp. 318–328. ACM, New York (1996)
21.
go back to reference Flynn Hummel, S., Schonberg, E., Flynn, L.E.: Factoring: a practical and robust method for scheduling parallel loops. In: Proceedings of ACM/IEEE Conference Supercomputing (Supercomputing 1991), pp. 610–632, November 1991 Flynn Hummel, S., Schonberg, E., Flynn, L.E.: Factoring: a practical and robust method for scheduling parallel loops. In: Proceedings of ACM/IEEE Conference Supercomputing (Supercomputing 1991), pp. 610–632, November 1991
22.
go back to reference Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1990)MATH Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1990)MATH
23.
go back to reference Govindaswamy. K.: An API for adaptive loop scheduling in shared address space architectures. Master’s thesis, Mississippi State University (2003) Govindaswamy. K.: An API for adaptive loop scheduling in shared address space architectures. Master’s thesis, Mississippi State University (2003)
24.
go back to reference Kale, V.: Low-overhead scheduling for improving performance of scientific applications. Ph.D. thesis, University of Illinois at Urbana-Champaign, August 2015 Kale, V.: Low-overhead scheduling for improving performance of scientific applications. Ph.D. thesis, University of Illinois at Urbana-Champaign, August 2015
25.
go back to reference Kale, V., Gropp, W.D.: A user-defined schedule for OpenMP. In: Proceedings of the 2017 Conference on OpenMP, Stonybrook, New York, USA, November 2017 (2017) Kale, V., Gropp, W.D.: A user-defined schedule for OpenMP. In: Proceedings of the 2017 Conference on OpenMP, Stonybrook, New York, USA, November 2017 (2017)
26.
go back to reference Klemm, M.: Twenty years of the OpenMP API. Scientific Computing World, April/May 2017 Klemm, M.: Twenty years of the OpenMP API. Scientific Computing World, April/May 2017
27.
go back to reference Li, H., Tandri, S., Stumm, M., Sevcik, K.C.: Locality and loop scheduling on NUMA multiprocessors. In: Proceedings of the 1993 International Conference on Parallel Processing, ICPP 1993, vol. 02, pp. 140–147. IEEE Computer Society, Washington, DC (1993) Li, H., Tandri, S., Stumm, M., Sevcik, K.C.: Locality and loop scheduling on NUMA multiprocessors. In: Proceedings of the 1993 International Conference on Parallel Processing, ICPP 1993, vol. 02, pp. 140–147. IEEE Computer Society, Washington, DC (1993)
28.
go back to reference Markatos, E.P., LeBlanc, T.J.: Using processor affinity in loop scheduling on shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 5(4), 379–400 (1994)CrossRef Markatos, E.P., LeBlanc, T.J.: Using processor affinity in loop scheduling on shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 5(4), 379–400 (1994)CrossRef
29.
go back to reference Penna, P.H., Castro, M., Plentz, P., Cota de Freitas, H., Broquedis, F., Méhaut, J.-F.: BinLPT: a novel workload-aware loop scheduler for irregular parallel loops. In: Simpósio em Sistemas Computacionais de Alto Desempenho, Campinas, Brazil, October 2017 Penna, P.H., Castro, M., Plentz, P., Cota de Freitas, H., Broquedis, F., Méhaut, J.-F.: BinLPT: a novel workload-aware loop scheduler for irregular parallel loops. In: Simpósio em Sistemas Computacionais de Alto Desempenho, Campinas, Brazil, October 2017
30.
go back to reference Penna, P.H., et al.: Assessing the performance of the SRR loop scheduler with irregular workloads. Procedia Comput. Sci. 108, 255–264 (2017). International Conference on Computational Science, ICCS 2017, 12–14 June 2017, Zurich, SwitzerlandCrossRef Penna, P.H., et al.: Assessing the performance of the SRR loop scheduler with irregular workloads. Procedia Comput. Sci. 108, 255–264 (2017). International Conference on Computational Science, ICCS 2017, 12–14 June 2017, Zurich, SwitzerlandCrossRef
31.
go back to reference Polychronopoulos, C.D., Kuck, D.J.: Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IEEE Trans. Comput. C–36(12), 1425–1439 (1987)CrossRef Polychronopoulos, C.D., Kuck, D.J.: Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IEEE Trans. Comput. C–36(12), 1425–1439 (1987)CrossRef
32.
go back to reference Subramaniam, S., Eager, D.L.: Affinity scheduling of unbalanced workloads. In: Proceedings of Supercomputing 1994, pp. 214–226, November 1994 Subramaniam, S., Eager, D.L.: Affinity scheduling of unbalanced workloads. In: Proceedings of Supercomputing 1994, pp. 214–226, November 1994
33.
go back to reference Tang, P., Yew, P.-C.: Processor self-scheduling for multiple-nested parallel loops. In: Proceedings of International Conference on Parallel Processing, vol. 12, pp. 528–535. IEEE (1986) Tang, P., Yew, P.-C.: Processor self-scheduling for multiple-nested parallel loops. In: Proceedings of International Conference on Parallel Processing, vol. 12, pp. 528–535. IEEE (1986)
35.
go back to reference Tzen, T.H., Ni, L.M.: Trapezoid self-scheduling: a practical scheduling scheme for parallel compilers. IEEE Trans. Parallel Distrib. Syst. 4(1), 87–98 (1993)CrossRef Tzen, T.H., Ni, L.M.: Trapezoid self-scheduling: a practical scheduling scheme for parallel compilers. IEEE Trans. Parallel Distrib. Syst. 4(1), 87–98 (1993)CrossRef
37.
go back to reference Zhang, Y., Burcea, M., Cheng, V., Ho, R., Voss M.: An adaptive OpenMP loop scheduler for hyperthreaded SMPs. In. Proceedings of International Conference on Parallel and Distributed Computing Systems (PDCS) (2004) Zhang, Y., Burcea, M., Cheng, V., Ho, R., Voss M.: An adaptive OpenMP loop scheduler for hyperthreaded SMPs. In. Proceedings of International Conference on Parallel and Distributed Computing Systems (PDCS) (2004)
38.
go back to reference Zhang, Y., Voss, M., Rogers, E.S.: Runtime empirical selection of loop schedulers on hyperthreaded SMPs. In: 19th IEEE International Parallel and Distributed Processing Symposium, p. 44b, April 2005 Zhang, Y., Voss, M., Rogers, E.S.: Runtime empirical selection of loop schedulers on hyperthreaded SMPs. In: 19th IEEE International Parallel and Distributed Processing Symposium, p. 44b, April 2005
Metadata
Title
OpenMP Loop Scheduling Revisited: Making a Case for More Schedules
Authors
Florina M. Ciorba
Christian Iwainsky
Patrick Buder
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-98521-3_2