research-article

GLB: lifeline-based global load balancing library in x10

Authors:
Wei Zhang

BM T. J. Watson Research Center, Yorktown Heights, USA

BM T. J. Watson Research Center, Yorktown Heights, USA
View Profile

,
Olivier Tardieu

IBM T. J. Watson Research Center, Yorktown Heights, USA

IBM T. J. Watson Research Center, Yorktown Heights, USA
View Profile

,
David Grove

IBM T. J. Watson Research Center, Yorktown Heights, USA

IBM T. J. Watson Research Center, Yorktown Heights, USA
View Profile

,
Benjamin Herta

IBM T. J. Watson Research Center, Yorktown Heights, USA

IBM T. J. Watson Research Center, Yorktown Heights, USA
View Profile

,
Tomio Kamada

Kobe University/RIKEN, Kobe, Japan

Kobe University/RIKEN, Kobe, Japan
View Profile

,
Vijay Saraswat

IBM T. J. Watson Research Center, Yorktown Heights, USA

IBM T. J. Watson Research Center, Yorktown Heights, USA
View Profile

,
Mikio Takeuchi

IBM Research - Tokyo, Tokyo, Japan

IBM Research - Tokyo, Tokyo, Japan
View Profile

PPAA '14: Proceedings of the first workshop on Parallel programming for analytics applicationsFebruary 2014Pages 31–40https://doi.org/10.1145/2567634.2567639

Published:16 February 2014Publication History

PPAA '14: Proceedings of the first workshop on Parallel programming for analytics applications

Pages 31–40

ABSTRACT

We present GLB, a programming model and an associated implementation that can handle a wide range of irregular parallel programming problems running over large-scale distributed systems. GLB is applicable both to problems that are easily load-balanced via static scheduling and to problems that are hard to statically load balance. GLB hides the intricate synchronizations (e.g., inter-node communication, initialization and startup, load balancing, termination and result collection) from the users. GLB internally uses a version of the lifeline graph based work-stealing algorithm proposed by Saraswat et al [25]. Users of GLB are simply required to write several pieces of sequential code that comply with the GLB interface. GLB then schedules and orchestrates the parallel execution of the code correctly and efficiently at scale.

We have applied GLB to two representative benchmarks: Betweenness Centrality (BC) and Unbalanced Tree Search (UTS). Among them, BC can be statically load-balanced whereas UTS cannot. In either case, GLB scales well -- achieving nearly linear speedup on different computer architectures (Power, Blue Gene/Q, and K) -- up to 16K cores.

References

D. A. Bader, J. Feo, J. Gilbert, J. Kepner, D. Koester, E. Loh, K. Madduri, B. Mann, and T. Meuse. HPCS Scalable Synthetic Compact Applications#2: Graph Analysis. http://www.graphanalysis.org/benchmark/HPCS-SSCA2_Graph-Theory_v2.2.pdf%, 2007.Google Scholar
J. E. Baldeschwieler, R. D. Blumofe, and E. A. Brewer. Atlas: an infrastructure for global computing. In EW 7: Proceedings of the 7th workshop on ACM SIGOPS European workshop, pages 165--172, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
R. Batoukov and T. Sorevik. A Generic Parallel Branch and Bound Environment on a Network of Workstations. In HiPer '99: Proceedings of High Performance Computing on Hewlett-Packard Systems, pages 474--483, 1999.Google Scholar
R. D. Blumofe and C. E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS), pages 356--368, 1994. Google ScholarDigital Library
R. D. Blumofe and P. A. Lisiecki. Adaptive and reliable parallel computing on networks of workstations. In ATEC '97: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 10--10, Berkeley, CA, USA, 1997. USENIX Association. Google ScholarDigital Library
G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing. In ICPP '08: Proceedings of the 2008 37th International Conference on Parallel Processing, pages 536--545, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI'04, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In SC '09, pages 1--11, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI'98, pages 212--223, Montreal, Quebec, Canada, June 1998. Google ScholarDigital Library
A. Grama and V. Kumar. State of the Art in Parallel Search Techniques for Discrete Optimization Problems. IEEE Trans. on Knowl. and Data Eng., 11(1):28--35, 1999. Google ScholarDigital Library
Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-First and Help-First Scheduling Policies for Async-Finish Task Parallelism. In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium, May 2009. Google ScholarDigital Library
L. V. Kalé and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. In OOPSLA'93, volume 28, pages 91--108, 1993. Google ScholarDigital Library
V. Kumar, A. Y. Grama, and N. R. Vempaty. Scalable load balancing techniques for parallel computers. J. Parallel Distrib. Comput., 22(1):60--79, 1994. Google ScholarDigital Library
Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skewtune: Mitigating skew in mapreduce applications. SIGMOD '12, pages 25--36, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
D. Lea. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande, JAVA '00, pages 36--43, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
J. Mellor-Crummey, L. Adhianto, G. Jin, M. Krentel, K. Murthy, W. Scherer, and C. Yang. Class II Submission to the HPC Challenge Award Competition Coarray Fortran 2.0, Nov. 2011.Google Scholar
K. Morton, A. Friesen, M. Balazinska, and D. Grossman. Estimating the progress of mapreduce pipelines. In ICDE'10, pages 681--684. IEEE, 2010.Google ScholarCross Ref
S. Olivier and J. Prins. Scalable dynamic load balancing using UPC. In ICPP '08: Proceedings of the 2008 37th International Conference on Parallel Processing, pages 123--131, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
OpenMP Architecture Review Board. OpenMP Application Program Interface, v3.0. May 2008.Google Scholar
J. Prins, J. Huan, B. Pugh, C.-W. Tseng, and P. Sadayappan. UPC Implementation of an Unbalanced Tree Search Benchmark. Technical Report 03-034, University of North Carolina at Chapel Hill, October 2003.Google Scholar
J. Reinders. Intel Threading Building Blocks. O'Reilly, 2007. Google ScholarDigital Library
V. Saraswat, G. Almasi, G. Bikshandi, C. Cascaval, D. Cunningham, D. Grove, S. Kodali, I. Peshansky, and O. Tardieu. The Asynchronous Partitioned Global Address Space Model. In AMP'10: Proceedings of The First Workshop on Advances in Message Passing, June 2010.Google Scholar
V. Saraswat, B. Bloom, I. Peshansky, O. Tardieu, and D. Grove. The X10 reference manual, v2.4. Sept. 2013.Google Scholar
V. Saraswat and R. Jagadeesan. Concurrent clustered programming. In Concur'05, pages 353--367, 2005. Google ScholarDigital Library
V. A. Saraswat, P. Kambadur, S. Kodali, D. Grove, and S. Krishnamoorthy. Lifeline-based global load balancing. PPoPP '11, pages 201--212, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
A. B. Sinha and L. V. Kalé. A load balancing strategy for prioritized execution of tasks. In IIPS'93: Proceedings of International Parallel Processing Symposium, pages 230--237, 1993. Google ScholarDigital Library
O. Tardieu, B. Herta, D. Cunningham, D. Grove, P. Kambadur, V. A. Saraswat, A. Shinnar, M. Takeuchi, and M. Vaziri. X10 and APGAS at Petascale. In Proceedings of the 19th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '14. ACM, 2014. Google ScholarDigital Library
O. Tardieu, H. Wang, and H. Lin. A work-stealing scheduler for X10's task parallelism with suspension. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 267--276, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
R. V. van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In PPoPP'01, pages 34--43, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
C. Yang, K. Murthy, and J. Mellor-Crummey. Managing asynchronous operations in coarray fortran 2.0. In IPDPS'13, pages 1321--1332, 2013. Google ScholarDigital Library

Index Terms

GLB: lifeline-based global load balancing library in x10
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
        Concurrent programming structures
        Control structures
      2. Language types
        Distributed programming languages

Recommendations

Tools-supported HPF and MPI parallelization of the NAS parallel benchmarks
FRONTIERS '96: Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation

High Performance Fortran (HPF) compilers and communication libraries with the standardized Message Passing Interface (MPI) are becoming widely available, easing the development of portable parallel applications. The Annai tool environment supports ...
Read More
A case for distributed work-stealing in regular applications
X10 2016: Proceedings of the 6th ACM SIGPLAN Workshop on X10

This paper presents a dynamically heterogeneous architecture use-case that is both realistic and favorable for distributed work-stealing in regular parallel applications. Using a straightforward implementation of distributed dense matrix multiplication ...
Read More
The Cilkview scalability analyzer
SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures

The Cilkview scalability analyzer is a software tool for profiling, estimating scalability, and benchmarking multithreaded Cilk++ applications. Cilkview monitors logical parallelism during an instrumented execution of the Cilk++ application on a single ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPAA '14: Proceedings of the first workshop on Parallel programming for analytics applications
February 2014
72 pages
ISBN:9781450326544
DOI:10.1145/2567634
General Chairs:
Manoj Kumar
IBM Research, USA
,
Joefon Jann
IBM Research, USA
,
Program Chair:
Priya Nagpurkar
IBM Research, USA
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 February 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
glb
performance
scalability
work-stealing
{x10
Qualifiers
- research-article
Conference

Acceptance Rates
PPAA '14 Paper Acceptance Rate6of7submissions,86%Overall Acceptance Rate6of7submissions,86%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 195
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GLB: lifeline-based global load balancing library in x10

PPAA '14: Proceedings of the first workshop on Parallel programming for analytics applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Tools-supported HPF and MPI parallelization of the NAS parallel benchmarks

A case for distributed work-stealing in regular applications

The Cilkview scalability analyzer

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

GLB: lifeline-based global load balancing library in x10

PPAA '14: Proceedings of the first workshop on Parallel programming for analytics applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Tools-supported HPF and MPI parallelization of the NAS parallel benchmarks

A case for distributed work-stealing in regular applications

The Cilkview scalability analyzer

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media