ABSTRACT
We present GLB, a programming model and an associated implementation that can handle a wide range of irregular parallel programming problems running over large-scale distributed systems. GLB is applicable both to problems that are easily load-balanced via static scheduling and to problems that are hard to statically load balance. GLB hides the intricate synchronizations (e.g., inter-node communication, initialization and startup, load balancing, termination and result collection) from the users. GLB internally uses a version of the lifeline graph based work-stealing algorithm proposed by Saraswat et al [25]. Users of GLB are simply required to write several pieces of sequential code that comply with the GLB interface. GLB then schedules and orchestrates the parallel execution of the code correctly and efficiently at scale.
We have applied GLB to two representative benchmarks: Betweenness Centrality (BC) and Unbalanced Tree Search (UTS). Among them, BC can be statically load-balanced whereas UTS cannot. In either case, GLB scales well -- achieving nearly linear speedup on different computer architectures (Power, Blue Gene/Q, and K) -- up to 16K cores.
- D. A. Bader, J. Feo, J. Gilbert, J. Kepner, D. Koester, E. Loh, K. Madduri, B. Mann, and T. Meuse. HPCS Scalable Synthetic Compact Applications#2: Graph Analysis. http://www.graphanalysis.org/benchmark/HPCS-SSCA2_Graph-Theory_v2.2.pdf%, 2007.Google Scholar
- J. E. Baldeschwieler, R. D. Blumofe, and E. A. Brewer. Atlas: an infrastructure for global computing. In EW 7: Proceedings of the 7th workshop on ACM SIGOPS European workshop, pages 165--172, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
- R. Batoukov and T. Sorevik. A Generic Parallel Branch and Bound Environment on a Network of Workstations. In HiPer '99: Proceedings of High Performance Computing on Hewlett-Packard Systems, pages 474--483, 1999.Google Scholar
- R. D. Blumofe and C. E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS), pages 356--368, 1994. Google ScholarDigital Library
- R. D. Blumofe and P. A. Lisiecki. Adaptive and reliable parallel computing on networks of workstations. In ATEC '97: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 10--10, Berkeley, CA, USA, 1997. USENIX Association. Google ScholarDigital Library
- G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing. In ICPP '08: Proceedings of the 2008 37th International Conference on Parallel Processing, pages 536--545, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI'04, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
- J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In SC '09, pages 1--11, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI'98, pages 212--223, Montreal, Quebec, Canada, June 1998. Google ScholarDigital Library
- A. Grama and V. Kumar. State of the Art in Parallel Search Techniques for Discrete Optimization Problems. IEEE Trans. on Knowl. and Data Eng., 11(1):28--35, 1999. Google ScholarDigital Library
- Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-First and Help-First Scheduling Policies for Async-Finish Task Parallelism. In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium, May 2009. Google ScholarDigital Library
- L. V. Kalé and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. In OOPSLA'93, volume 28, pages 91--108, 1993. Google ScholarDigital Library
- V. Kumar, A. Y. Grama, and N. R. Vempaty. Scalable load balancing techniques for parallel computers. J. Parallel Distrib. Comput., 22(1):60--79, 1994. Google ScholarDigital Library
- Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skewtune: Mitigating skew in mapreduce applications. SIGMOD '12, pages 25--36, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- D. Lea. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande, JAVA '00, pages 36--43, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- J. Mellor-Crummey, L. Adhianto, G. Jin, M. Krentel, K. Murthy, W. Scherer, and C. Yang. Class II Submission to the HPC Challenge Award Competition Coarray Fortran 2.0, Nov. 2011.Google Scholar
- K. Morton, A. Friesen, M. Balazinska, and D. Grossman. Estimating the progress of mapreduce pipelines. In ICDE'10, pages 681--684. IEEE, 2010.Google ScholarCross Ref
- S. Olivier and J. Prins. Scalable dynamic load balancing using UPC. In ICPP '08: Proceedings of the 2008 37th International Conference on Parallel Processing, pages 123--131, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
- OpenMP Architecture Review Board. OpenMP Application Program Interface, v3.0. May 2008.Google Scholar
- J. Prins, J. Huan, B. Pugh, C.-W. Tseng, and P. Sadayappan. UPC Implementation of an Unbalanced Tree Search Benchmark. Technical Report 03-034, University of North Carolina at Chapel Hill, October 2003.Google Scholar
- J. Reinders. Intel Threading Building Blocks. O'Reilly, 2007. Google ScholarDigital Library
- V. Saraswat, G. Almasi, G. Bikshandi, C. Cascaval, D. Cunningham, D. Grove, S. Kodali, I. Peshansky, and O. Tardieu. The Asynchronous Partitioned Global Address Space Model. In AMP'10: Proceedings of The First Workshop on Advances in Message Passing, June 2010.Google Scholar
- V. Saraswat, B. Bloom, I. Peshansky, O. Tardieu, and D. Grove. The X10 reference manual, v2.4. Sept. 2013.Google Scholar
- V. Saraswat and R. Jagadeesan. Concurrent clustered programming. In Concur'05, pages 353--367, 2005. Google ScholarDigital Library
- V. A. Saraswat, P. Kambadur, S. Kodali, D. Grove, and S. Krishnamoorthy. Lifeline-based global load balancing. PPoPP '11, pages 201--212, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- A. B. Sinha and L. V. Kalé. A load balancing strategy for prioritized execution of tasks. In IIPS'93: Proceedings of International Parallel Processing Symposium, pages 230--237, 1993. Google ScholarDigital Library
- O. Tardieu, B. Herta, D. Cunningham, D. Grove, P. Kambadur, V. A. Saraswat, A. Shinnar, M. Takeuchi, and M. Vaziri. X10 and APGAS at Petascale. In Proceedings of the 19th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '14. ACM, 2014. Google ScholarDigital Library
- O. Tardieu, H. Wang, and H. Lin. A work-stealing scheduler for X10's task parallelism with suspension. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 267--276, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- R. V. van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In PPoPP'01, pages 34--43, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- C. Yang, K. Murthy, and J. Mellor-Crummey. Managing asynchronous operations in coarray fortran 2.0. In IPDPS'13, pages 1321--1332, 2013. Google ScholarDigital Library
Index Terms
- GLB: lifeline-based global load balancing library in x10
Recommendations
Tools-supported HPF and MPI parallelization of the NAS parallel benchmarks
FRONTIERS '96: Proceedings of the 6th Symposium on the Frontiers of Massively Parallel ComputationHigh Performance Fortran (HPF) compilers and communication libraries with the standardized Message Passing Interface (MPI) are becoming widely available, easing the development of portable parallel applications. The Annai tool environment supports ...
A case for distributed work-stealing in regular applications
X10 2016: Proceedings of the 6th ACM SIGPLAN Workshop on X10This paper presents a dynamically heterogeneous architecture use-case that is both realistic and favorable for distributed work-stealing in regular parallel applications. Using a straightforward implementation of distributed dense matrix multiplication ...
The Cilkview scalability analyzer
SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architecturesThe Cilkview scalability analyzer is a software tool for profiling, estimating scalability, and benchmarking multithreaded Cilk++ applications. Cilkview monitors logical parallelism during an instrumented execution of the Cilk++ application on a single ...
Comments