research-article

Scheduling parallel programs by work stealing with private deques

Authors:
Umut A. Acar

Carnegie Mellon University, Pittsburgh, USA

Carnegie Mellon University, Pittsburgh, USA
View Profile

,
Arthur Chargueraud

Inria Saclay, Paris, France

Inria Saclay, Paris, France
View Profile

,
Mike Rainey

Max Planck Institute for Software Systems, Kaiserslautern, Germany

Max Planck Institute for Software Systems, Kaiserslautern, Germany
View Profile

PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programmingFebruary 2013Pages 219–228https://doi.org/10.1145/2442516.2442538

Published:23 February 2013Publication History

PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 219–228

ABSTRACT

Work stealing has proven to be an effective method for scheduling parallel programs on multicore computers. To achieve high performance, work stealing distributes tasks between concurrent queues, called deques, which are assigned to each processor. Each processor operates on its deque locally except when performing load balancing via steals. Unfortunately, concurrent deques suffer from two limitations: 1) local deque operations require expensive memory fences in modern weak-memory architectures, 2) they can be very difficult to extend to support various optimizations and flexible forms of task distribution strategies needed many applications, e.g., those that do not fit nicely into the divide-and-conquer, nested data parallel paradigm.

For these reasons, there has been a lot recent interest in implementations of work stealing with non-concurrent deques, where deques remain entirely private to each processor and load balancing is performed via message passing. Private deques eliminate the need for memory fences from local operations and enable the design and implementation of efficient techniques for reducing task-creation overheads and improving task distribution. These advantages, however, come at the cost of communication. It is not known whether work stealing with private deques enjoys the theoretical guarantees of concurrent deques and whether they can be effective in practice.

In this paper, we propose two work-stealing algorithms with private deques and prove that the algorithms guarantee similar theoretical bounds as work stealing with concurrent deques. For the analysis, we use a probabilistic model and consider a new parameter, the branching depth of the computation. We present an implementation of the algorithm as a C++ library and show that it compares well to Cilk on a range of benchmarks. Since our approach relies on private deques, it enables implementing flexible task creation and distribution strategies. As a specific example, we show how to implement task coalescing and steal-half strategies, which can be important in fine-grain, non-divide-and-conquer algorithms such as graph algorithms, and apply them to the depth-first-search problem.

References

Umut A. Acar, Arthur Charguéraud, and Mike Rainey. Technical report associated with the present paper. http://arthur.chargueraud.org/research/2013/ppopp/full.pdfGoogle Scholar
Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work stealing. Theory of Computing Systems (TOCS), 35(3):321--347, 2002.Google Scholar
Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In SPAA '98, pages 119--129. ACM Press, 1998. Google ScholarDigital Library
Petra Berenbrink, Tom Friedetzky, and Leslie Ann Goldberg. The natural work-stealing algorithm is stable. SIAM J. Comput., 32:1260--1279, May 2003. Google ScholarDigital Library
Guy E. Blelloch, Rezaul A. Chowdhury, Phillip B. Gibbons, Vijaya Ramachandran, Shimin Chen, and Michael Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In In the Proceedings of the 19th ACM-SIAM Symposium on Discrete Algorithms, pages 501--510, 2008. Google ScholarDigital Library
Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. Internally deterministic parallel algorithms can be fast. In PPoPP '12, pages 181--192, NY, USA, 2012. ACM. Google ScholarDigital Library
Guy E. Blelloch and John Greiner. A provable time and space efficient implementation of NESL. In ICFP '96, pages 213--225. ACM, 1996. Google ScholarDigital Library
R.D. Blumofe and C.E. Leiserson. Scheduling multithreaded computations by work stealing. Foundations of Computer Science, IEEE Annual Symposium on, 0:356--368, 1994. Google ScholarDigital Library
Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. In PPoPP, pages 207--216, 1995. Google ScholarDigital Library
Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46:720--748, September 1999. Google ScholarDigital Library
F. Warren Burton and M. Ronan Sleep. Executing functional programs on a virtual tree of processors. In Functional Programming Languages and Computer Architecture (FPCA '81), pages 187--194. ACM Press, October 1981. Google ScholarDigital Library
David Chase and Yossi Lev. Dynamic circular work-stealing deque. In SPAA '05, pages 21--28, 2005. Google ScholarDigital Library
David Chase and Yossi Lev. Dynamic circular work-stealing deque. Technical report, Sun Microsystems, 2005.Google Scholar
Guojing Cong, Sreedhar B. Kodali, Sriram Krishnamoorthy, Doug Lea, Vijay A. Saraswat, and Tong Wen. Solving large, irregular graph problems using adaptive work-stealing. In ICPP, pages 536--545, 2008. Google ScholarDigital Library
Sivarama P. Dandamudi. The effect of scheduling discipline on dynamic load sharing in heterogeneous distributed systems. Modeling, Analysis, and Simulation of Computer Systems, International Symposium on, 0:17, 1997. Google ScholarDigital Library
J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. Dynamic load balancing of unbalanced computations using message passing. In IPDPS '07. IEEE International, march 2007.Google ScholarCross Ref
James Dinan, D. Brian Larkins, P. Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 53:1--53:11. ACM, 2009. Google ScholarDigital Library
Derek L. Eager, Edward D. Lazowska, and John Zahorjan. A comparison of receiver-initiated and sender-initiated adaptive load sharing. Perform. Eval., 6(1):53--68, 1986. Google ScholarDigital Library
Marc Feeley. A message passing implementation of lazy task creation. In Parallel Symbolic Computing, pages 94--107, 1992. Google ScholarDigital Library
Marc Feeley. An efficient and general implementation of futures on large scale shared-memory multiprocessors. PhD thesis, Brandeis University, MA, USA, 1993. UMI Order No. GAX93--22348. Google ScholarDigital Library
Marc Feeley. Polling efficiently on stock hardware. In Proceedings of the conference on Functional programming languages and computer architecture, FPCA '93, pages 179--187, NY, USA, 1993. ACM. Google ScholarDigital Library
Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly threaded parallelism in Manticore. Journal of Functional Programming, 20(5--6):1--40, 2011. Google ScholarDigital Library
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998. Google ScholarDigital Library
David Grove, Olivier Tardieu, David Cunningham, Ben Herta, Igor Peshansky, and Vijay Saraswat. A performance model for x10 applications: what's going on under the hood? In Proceedings of the 2011 ACM SIGPLAN X10 Workshop, pages 1:1--1:8, NY, USA, 2011. ACM. Google ScholarDigital Library
Robert H. Halstead, Jr. Implementation of multilisp: Lisp on a multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and functional programming, LFP '84, pages 9--17. ACM, 1984. Google ScholarDigital Library
Danny Hendler, Yossi Lev, Mark Moir, and Nir Shavit. A dynamic-sized nonblocking work stealing deque. Distrib. Comput., 18:189--207, February 2006. Google ScholarDigital Library
Danny Hendler and Nir Shavit. Non-blocking steal-half work queues. In PODC, pages 280--289, 2002. Google ScholarDigital Library
Danny Hendler and Nir Shavit. Work dealing. In SPAA '02, pages 164--172. ACM, 2002. Google ScholarDigital Library
Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. Backtracking-based load balancing. In PPoPP '09, pages 55--64. ACM, 2009. Google ScholarDigital Library
Intel. Cilk Plus. http://software.intel.com/en-us/articles/intel-cilk-plus/.Google Scholar
Intel. Intel Xeon Processor X7550. Specifications at http://ark.intel.com/products/46498/Intel-Xeon-Processor-X7550-(18M-Cache-2_00-GHz-6_40-GTs-Intel-QPI).Google Scholar
Gabriele Keller, Manuel M.T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Ben Lippmeier. Regular, shape-polymorphic, parallel arrays in haskell. In ICFP '10, pages 261--272, 2010. Google ScholarDigital Library
Sanjeev Kumar, Christopher J. Hughes, and Anthony Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. SIGARCH Computer Architecture News, 35:162--173, June 2007. Google ScholarDigital Library
Maged M. Michael, Martin T. Vechev, and Vijay A. Saraswat. Idempotent work stealing. In PPoPP '09, pages 45--54, 2009. Google ScholarDigital Library
R. Mirchandaney, D. Towsley, and J.A. Stankovic. Analysis of the effects of delays on load sharing. Computers, IEEE Transactions on, 38(11):1513--1525, nov 1989. Google ScholarDigital Library
Michael Mitzenmacher. Analyses of load stealing models based on differential equations. In SPAA '98, pages 212--221, NY, USA, 1998. ACM. Google ScholarDigital Library
Larry Rudolph, Miriam Slivkin-Allalouf, and Eli Upfal. A simple load balancing scheme for task allocation in parallel machines. In SPAA '91, pages 237--245, NY, USA, 1991. ACM. Google ScholarDigital Library
Bratin Saha, Ali-Reza Adl-Tabatabai, Anwar Ghuloum, Mohan Rajagopalan, Richard L. Hudson, Leaf Petersen, Vijay Menon, Brian Murphy, Tatiana Shpeisman, Jesse Fang, Eric Sprangle, Anwar Rohillah, and Doug Carmean. Enabling scalability and performance in a large scale chip multiprocessor environment. Technical Report. Intel Corp., 2006.Google Scholar
Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. Flexible architectural support for fine-grain scheduling. In ASPLOS '10, pages 311--322, NY, USA, 2010. ACM. Google ScholarDigital Library
Peter Sanders. Randomized receiver initiated load-balancing algorithms for tree-shaped computations. Comput. J., 45(5):561--573, 2002.Google ScholarCross Ref
Barry Tannenbaum. Miser - a dynamically loadable memory allocator for multi-threaded applications. Intel Software Network, 2009.Google Scholar
Marc Tchiboukdjian, Nicolas Gast, Denis Trystram, Jean-Louis Roch, and Julien Bernard. A tighter analysis of work stealing. In Algorithms and Computation - 21st International Symposium, ISAAC 2010, volume 6507 of LNCS, pages 291--302. Springer, 2010.Google Scholar
Alexandros Tzannes. Enhancing Productivity and Performance Portability of General-Purpose Parallel Programming. PhD thesis, University of Maryland, 2012.Google Scholar
Seiji Umatani, Masahiro Yasugi, Tsuneyasu Komiya, and Taiichi Yuasa. Pursuing laziness for efficient implementation of modern multithreaded languages. In ISHPC, pages 174--188, 2003.Google ScholarCross Ref

Index Terms

Scheduling parallel programs by work stealing with private deques
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments

Recommendations

Scheduling parallel programs by work stealing with private deques
PPoPP '13

Work stealing has proven to be an effective method for scheduling parallel programs on multicore computers. To achieve high performance, work stealing distributes tasks between concurrent queues, called deques, which are assigned to each processor. Each ...
Read More
Dynamic circular work-stealing deque
SPAA '05: Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures

The non-blocking work-stealing algorithm of Arora, Blumofe, and Plaxton (henceforth ABP work-stealing) is on its way to becoming the multiprocessor load balancing technology of choice in both industry and academia. This highly efficient scheme is based ...
Read More
Efficient Work-Stealing with Blocking Deques
HPCC '14: Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)

Work stealing is a popular and effective approach to implement load balancing in modern multi-/many-core systems, where each parallel thread has its local deque to maintain its own work-set of tasks and performs load balancing by stealing tasks from ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
February 2013
332 pages
ISBN:9781450319225
DOI:10.1145/2442516
General Chairs:
Alex Nicolau
University of California, Irvine, USA
,
Xiaowei Shen
IBM Research, China
,
Program Chairs:
Saman Amarasinghe
Massachusetts Institute of Technology, USA
,
Richard Vuduc
Georgia Institute of Technology, USA
ACM SIGPLAN Notices Volume 48, Issue 8
PPoPP '13
August 2013
309 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2517327
Issue’s Table of Contents
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 February 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dynamic load balancing
nested parallelism
work stealing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate230of1,014submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 111
  Total Citations
  View Citations
- 933
  Total Downloads
- Downloads (Last 12 months)49
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scheduling parallel programs by work stealing with private deques

PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

Scheduling parallel programs by work stealing with private deques

Dynamic circular work-stealing deque

Efficient Work-Stealing with Blocking Deques