ABSTRACT
Work stealing has proven to be an effective method for scheduling parallel programs on multicore computers. To achieve high performance, work stealing distributes tasks between concurrent queues, called deques, which are assigned to each processor. Each processor operates on its deque locally except when performing load balancing via steals. Unfortunately, concurrent deques suffer from two limitations: 1) local deque operations require expensive memory fences in modern weak-memory architectures, 2) they can be very difficult to extend to support various optimizations and flexible forms of task distribution strategies needed many applications, e.g., those that do not fit nicely into the divide-and-conquer, nested data parallel paradigm.
For these reasons, there has been a lot recent interest in implementations of work stealing with non-concurrent deques, where deques remain entirely private to each processor and load balancing is performed via message passing. Private deques eliminate the need for memory fences from local operations and enable the design and implementation of efficient techniques for reducing task-creation overheads and improving task distribution. These advantages, however, come at the cost of communication. It is not known whether work stealing with private deques enjoys the theoretical guarantees of concurrent deques and whether they can be effective in practice.
In this paper, we propose two work-stealing algorithms with private deques and prove that the algorithms guarantee similar theoretical bounds as work stealing with concurrent deques. For the analysis, we use a probabilistic model and consider a new parameter, the branching depth of the computation. We present an implementation of the algorithm as a C++ library and show that it compares well to Cilk on a range of benchmarks. Since our approach relies on private deques, it enables implementing flexible task creation and distribution strategies. As a specific example, we show how to implement task coalescing and steal-half strategies, which can be important in fine-grain, non-divide-and-conquer algorithms such as graph algorithms, and apply them to the depth-first-search problem.
- Umut A. Acar, Arthur Charguéraud, and Mike Rainey. Technical report associated with the present paper. http://arthur.chargueraud.org/research/2013/ppopp/full.pdfGoogle Scholar
- Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work stealing. Theory of Computing Systems (TOCS), 35(3):321--347, 2002.Google Scholar
- Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In SPAA '98, pages 119--129. ACM Press, 1998. Google ScholarDigital Library
- Petra Berenbrink, Tom Friedetzky, and Leslie Ann Goldberg. The natural work-stealing algorithm is stable. SIAM J. Comput., 32:1260--1279, May 2003. Google ScholarDigital Library
- Guy E. Blelloch, Rezaul A. Chowdhury, Phillip B. Gibbons, Vijaya Ramachandran, Shimin Chen, and Michael Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In In the Proceedings of the 19th ACM-SIAM Symposium on Discrete Algorithms, pages 501--510, 2008. Google ScholarDigital Library
- Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. Internally deterministic parallel algorithms can be fast. In PPoPP '12, pages 181--192, NY, USA, 2012. ACM. Google ScholarDigital Library
- Guy E. Blelloch and John Greiner. A provable time and space efficient implementation of NESL. In ICFP '96, pages 213--225. ACM, 1996. Google ScholarDigital Library
- R.D. Blumofe and C.E. Leiserson. Scheduling multithreaded computations by work stealing. Foundations of Computer Science, IEEE Annual Symposium on, 0:356--368, 1994. Google ScholarDigital Library
- Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. In PPoPP, pages 207--216, 1995. Google ScholarDigital Library
- Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46:720--748, September 1999. Google ScholarDigital Library
- F. Warren Burton and M. Ronan Sleep. Executing functional programs on a virtual tree of processors. In Functional Programming Languages and Computer Architecture (FPCA '81), pages 187--194. ACM Press, October 1981. Google ScholarDigital Library
- David Chase and Yossi Lev. Dynamic circular work-stealing deque. In SPAA '05, pages 21--28, 2005. Google ScholarDigital Library
- David Chase and Yossi Lev. Dynamic circular work-stealing deque. Technical report, Sun Microsystems, 2005.Google Scholar
- Guojing Cong, Sreedhar B. Kodali, Sriram Krishnamoorthy, Doug Lea, Vijay A. Saraswat, and Tong Wen. Solving large, irregular graph problems using adaptive work-stealing. In ICPP, pages 536--545, 2008. Google ScholarDigital Library
- Sivarama P. Dandamudi. The effect of scheduling discipline on dynamic load sharing in heterogeneous distributed systems. Modeling, Analysis, and Simulation of Computer Systems, International Symposium on, 0:17, 1997. Google ScholarDigital Library
- J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. Dynamic load balancing of unbalanced computations using message passing. In IPDPS '07. IEEE International, march 2007.Google ScholarCross Ref
- James Dinan, D. Brian Larkins, P. Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 53:1--53:11. ACM, 2009. Google ScholarDigital Library
- Derek L. Eager, Edward D. Lazowska, and John Zahorjan. A comparison of receiver-initiated and sender-initiated adaptive load sharing. Perform. Eval., 6(1):53--68, 1986. Google ScholarDigital Library
- Marc Feeley. A message passing implementation of lazy task creation. In Parallel Symbolic Computing, pages 94--107, 1992. Google ScholarDigital Library
- Marc Feeley. An efficient and general implementation of futures on large scale shared-memory multiprocessors. PhD thesis, Brandeis University, MA, USA, 1993. UMI Order No. GAX93--22348. Google ScholarDigital Library
- Marc Feeley. Polling efficiently on stock hardware. In Proceedings of the conference on Functional programming languages and computer architecture, FPCA '93, pages 179--187, NY, USA, 1993. ACM. Google ScholarDigital Library
- Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly threaded parallelism in Manticore. Journal of Functional Programming, 20(5--6):1--40, 2011. Google ScholarDigital Library
- Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998. Google ScholarDigital Library
- David Grove, Olivier Tardieu, David Cunningham, Ben Herta, Igor Peshansky, and Vijay Saraswat. A performance model for x10 applications: what's going on under the hood? In Proceedings of the 2011 ACM SIGPLAN X10 Workshop, pages 1:1--1:8, NY, USA, 2011. ACM. Google ScholarDigital Library
- Robert H. Halstead, Jr. Implementation of multilisp: Lisp on a multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and functional programming, LFP '84, pages 9--17. ACM, 1984. Google ScholarDigital Library
- Danny Hendler, Yossi Lev, Mark Moir, and Nir Shavit. A dynamic-sized nonblocking work stealing deque. Distrib. Comput., 18:189--207, February 2006. Google ScholarDigital Library
- Danny Hendler and Nir Shavit. Non-blocking steal-half work queues. In PODC, pages 280--289, 2002. Google ScholarDigital Library
- Danny Hendler and Nir Shavit. Work dealing. In SPAA '02, pages 164--172. ACM, 2002. Google ScholarDigital Library
- Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. Backtracking-based load balancing. In PPoPP '09, pages 55--64. ACM, 2009. Google ScholarDigital Library
- Intel. Cilk Plus. http://software.intel.com/en-us/articles/intel-cilk-plus/.Google Scholar
- Intel. Intel Xeon Processor X7550. Specifications at http://ark.intel.com/products/46498/Intel-Xeon-Processor-X7550-(18M-Cache-2_00-GHz-6_40-GTs-Intel-QPI).Google Scholar
- Gabriele Keller, Manuel M.T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Ben Lippmeier. Regular, shape-polymorphic, parallel arrays in haskell. In ICFP '10, pages 261--272, 2010. Google ScholarDigital Library
- Sanjeev Kumar, Christopher J. Hughes, and Anthony Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. SIGARCH Computer Architecture News, 35:162--173, June 2007. Google ScholarDigital Library
- Maged M. Michael, Martin T. Vechev, and Vijay A. Saraswat. Idempotent work stealing. In PPoPP '09, pages 45--54, 2009. Google ScholarDigital Library
- R. Mirchandaney, D. Towsley, and J.A. Stankovic. Analysis of the effects of delays on load sharing. Computers, IEEE Transactions on, 38(11):1513--1525, nov 1989. Google ScholarDigital Library
- Michael Mitzenmacher. Analyses of load stealing models based on differential equations. In SPAA '98, pages 212--221, NY, USA, 1998. ACM. Google ScholarDigital Library
- Larry Rudolph, Miriam Slivkin-Allalouf, and Eli Upfal. A simple load balancing scheme for task allocation in parallel machines. In SPAA '91, pages 237--245, NY, USA, 1991. ACM. Google ScholarDigital Library
- Bratin Saha, Ali-Reza Adl-Tabatabai, Anwar Ghuloum, Mohan Rajagopalan, Richard L. Hudson, Leaf Petersen, Vijay Menon, Brian Murphy, Tatiana Shpeisman, Jesse Fang, Eric Sprangle, Anwar Rohillah, and Doug Carmean. Enabling scalability and performance in a large scale chip multiprocessor environment. Technical Report. Intel Corp., 2006.Google Scholar
- Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. Flexible architectural support for fine-grain scheduling. In ASPLOS '10, pages 311--322, NY, USA, 2010. ACM. Google ScholarDigital Library
- Peter Sanders. Randomized receiver initiated load-balancing algorithms for tree-shaped computations. Comput. J., 45(5):561--573, 2002.Google ScholarCross Ref
- Barry Tannenbaum. Miser - a dynamically loadable memory allocator for multi-threaded applications. Intel Software Network, 2009.Google Scholar
- Marc Tchiboukdjian, Nicolas Gast, Denis Trystram, Jean-Louis Roch, and Julien Bernard. A tighter analysis of work stealing. In Algorithms and Computation - 21st International Symposium, ISAAC 2010, volume 6507 of LNCS, pages 291--302. Springer, 2010.Google Scholar
- Alexandros Tzannes. Enhancing Productivity and Performance Portability of General-Purpose Parallel Programming. PhD thesis, University of Maryland, 2012.Google Scholar
- Seiji Umatani, Masahiro Yasugi, Tsuneyasu Komiya, and Taiichi Yuasa. Pursuing laziness for efficient implementation of modern multithreaded languages. In ISHPC, pages 174--188, 2003.Google ScholarCross Ref
Index Terms
- Scheduling parallel programs by work stealing with private deques
Recommendations
Scheduling parallel programs by work stealing with private deques
PPoPP '13Work stealing has proven to be an effective method for scheduling parallel programs on multicore computers. To achieve high performance, work stealing distributes tasks between concurrent queues, called deques, which are assigned to each processor. Each ...
Dynamic circular work-stealing deque
SPAA '05: Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architecturesThe non-blocking work-stealing algorithm of Arora, Blumofe, and Plaxton (henceforth ABP work-stealing) is on its way to becoming the multiprocessor load balancing technology of choice in both industry and academia. This highly efficient scheme is based ...
Efficient Work-Stealing with Blocking Deques
HPCC '14: Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)Work stealing is a popular and effective approach to implement load balancing in modern multi-/many-core systems, where each parallel thread has its local deque to maintain its own work-set of tasks and performs load balancing by stealing tasks from ...
Comments