skip to main content
10.1145/2442516.2442538acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Scheduling parallel programs by work stealing with private deques

Published:23 February 2013Publication History

ABSTRACT

Work stealing has proven to be an effective method for scheduling parallel programs on multicore computers. To achieve high performance, work stealing distributes tasks between concurrent queues, called deques, which are assigned to each processor. Each processor operates on its deque locally except when performing load balancing via steals. Unfortunately, concurrent deques suffer from two limitations: 1) local deque operations require expensive memory fences in modern weak-memory architectures, 2) they can be very difficult to extend to support various optimizations and flexible forms of task distribution strategies needed many applications, e.g., those that do not fit nicely into the divide-and-conquer, nested data parallel paradigm.

For these reasons, there has been a lot recent interest in implementations of work stealing with non-concurrent deques, where deques remain entirely private to each processor and load balancing is performed via message passing. Private deques eliminate the need for memory fences from local operations and enable the design and implementation of efficient techniques for reducing task-creation overheads and improving task distribution. These advantages, however, come at the cost of communication. It is not known whether work stealing with private deques enjoys the theoretical guarantees of concurrent deques and whether they can be effective in practice.

In this paper, we propose two work-stealing algorithms with private deques and prove that the algorithms guarantee similar theoretical bounds as work stealing with concurrent deques. For the analysis, we use a probabilistic model and consider a new parameter, the branching depth of the computation. We present an implementation of the algorithm as a C++ library and show that it compares well to Cilk on a range of benchmarks. Since our approach relies on private deques, it enables implementing flexible task creation and distribution strategies. As a specific example, we show how to implement task coalescing and steal-half strategies, which can be important in fine-grain, non-divide-and-conquer algorithms such as graph algorithms, and apply them to the depth-first-search problem.

References

  1. Umut A. Acar, Arthur Charguéraud, and Mike Rainey. Technical report associated with the present paper. http://arthur.chargueraud.org/research/2013/ppopp/full.pdfGoogle ScholarGoogle Scholar
  2. Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work stealing. Theory of Computing Systems (TOCS), 35(3):321--347, 2002.Google ScholarGoogle Scholar
  3. Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In SPAA '98, pages 119--129. ACM Press, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Petra Berenbrink, Tom Friedetzky, and Leslie Ann Goldberg. The natural work-stealing algorithm is stable. SIAM J. Comput., 32:1260--1279, May 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Guy E. Blelloch, Rezaul A. Chowdhury, Phillip B. Gibbons, Vijaya Ramachandran, Shimin Chen, and Michael Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In In the Proceedings of the 19th ACM-SIAM Symposium on Discrete Algorithms, pages 501--510, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. Internally deterministic parallel algorithms can be fast. In PPoPP '12, pages 181--192, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Guy E. Blelloch and John Greiner. A provable time and space efficient implementation of NESL. In ICFP '96, pages 213--225. ACM, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R.D. Blumofe and C.E. Leiserson. Scheduling multithreaded computations by work stealing. Foundations of Computer Science, IEEE Annual Symposium on, 0:356--368, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. In PPoPP, pages 207--216, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46:720--748, September 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. F. Warren Burton and M. Ronan Sleep. Executing functional programs on a virtual tree of processors. In Functional Programming Languages and Computer Architecture (FPCA '81), pages 187--194. ACM Press, October 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. David Chase and Yossi Lev. Dynamic circular work-stealing deque. In SPAA '05, pages 21--28, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. David Chase and Yossi Lev. Dynamic circular work-stealing deque. Technical report, Sun Microsystems, 2005.Google ScholarGoogle Scholar
  14. Guojing Cong, Sreedhar B. Kodali, Sriram Krishnamoorthy, Doug Lea, Vijay A. Saraswat, and Tong Wen. Solving large, irregular graph problems using adaptive work-stealing. In ICPP, pages 536--545, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sivarama P. Dandamudi. The effect of scheduling discipline on dynamic load sharing in heterogeneous distributed systems. Modeling, Analysis, and Simulation of Computer Systems, International Symposium on, 0:17, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. Dynamic load balancing of unbalanced computations using message passing. In IPDPS '07. IEEE International, march 2007.Google ScholarGoogle ScholarCross RefCross Ref
  17. James Dinan, D. Brian Larkins, P. Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 53:1--53:11. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Derek L. Eager, Edward D. Lazowska, and John Zahorjan. A comparison of receiver-initiated and sender-initiated adaptive load sharing. Perform. Eval., 6(1):53--68, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Marc Feeley. A message passing implementation of lazy task creation. In Parallel Symbolic Computing, pages 94--107, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Marc Feeley. An efficient and general implementation of futures on large scale shared-memory multiprocessors. PhD thesis, Brandeis University, MA, USA, 1993. UMI Order No. GAX93--22348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Marc Feeley. Polling efficiently on stock hardware. In Proceedings of the conference on Functional programming languages and computer architecture, FPCA '93, pages 179--187, NY, USA, 1993. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly threaded parallelism in Manticore. Journal of Functional Programming, 20(5--6):1--40, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. David Grove, Olivier Tardieu, David Cunningham, Ben Herta, Igor Peshansky, and Vijay Saraswat. A performance model for x10 applications: what's going on under the hood? In Proceedings of the 2011 ACM SIGPLAN X10 Workshop, pages 1:1--1:8, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Robert H. Halstead, Jr. Implementation of multilisp: Lisp on a multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and functional programming, LFP '84, pages 9--17. ACM, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Danny Hendler, Yossi Lev, Mark Moir, and Nir Shavit. A dynamic-sized nonblocking work stealing deque. Distrib. Comput., 18:189--207, February 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Danny Hendler and Nir Shavit. Non-blocking steal-half work queues. In PODC, pages 280--289, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Danny Hendler and Nir Shavit. Work dealing. In SPAA '02, pages 164--172. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. Backtracking-based load balancing. In PPoPP '09, pages 55--64. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Intel. Cilk Plus. http://software.intel.com/en-us/articles/intel-cilk-plus/.Google ScholarGoogle Scholar
  31. Intel. Intel Xeon Processor X7550. Specifications at http://ark.intel.com/products/46498/Intel-Xeon-Processor-X7550-(18M-Cache-2_00-GHz-6_40-GTs-Intel-QPI).Google ScholarGoogle Scholar
  32. Gabriele Keller, Manuel M.T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Ben Lippmeier. Regular, shape-polymorphic, parallel arrays in haskell. In ICFP '10, pages 261--272, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Sanjeev Kumar, Christopher J. Hughes, and Anthony Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. SIGARCH Computer Architecture News, 35:162--173, June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Maged M. Michael, Martin T. Vechev, and Vijay A. Saraswat. Idempotent work stealing. In PPoPP '09, pages 45--54, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. Mirchandaney, D. Towsley, and J.A. Stankovic. Analysis of the effects of delays on load sharing. Computers, IEEE Transactions on, 38(11):1513--1525, nov 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Michael Mitzenmacher. Analyses of load stealing models based on differential equations. In SPAA '98, pages 212--221, NY, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Larry Rudolph, Miriam Slivkin-Allalouf, and Eli Upfal. A simple load balancing scheme for task allocation in parallel machines. In SPAA '91, pages 237--245, NY, USA, 1991. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Bratin Saha, Ali-Reza Adl-Tabatabai, Anwar Ghuloum, Mohan Rajagopalan, Richard L. Hudson, Leaf Petersen, Vijay Menon, Brian Murphy, Tatiana Shpeisman, Jesse Fang, Eric Sprangle, Anwar Rohillah, and Doug Carmean. Enabling scalability and performance in a large scale chip multiprocessor environment. Technical Report. Intel Corp., 2006.Google ScholarGoogle Scholar
  39. Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. Flexible architectural support for fine-grain scheduling. In ASPLOS '10, pages 311--322, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Peter Sanders. Randomized receiver initiated load-balancing algorithms for tree-shaped computations. Comput. J., 45(5):561--573, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  41. Barry Tannenbaum. Miser - a dynamically loadable memory allocator for multi-threaded applications. Intel Software Network, 2009.Google ScholarGoogle Scholar
  42. Marc Tchiboukdjian, Nicolas Gast, Denis Trystram, Jean-Louis Roch, and Julien Bernard. A tighter analysis of work stealing. In Algorithms and Computation - 21st International Symposium, ISAAC 2010, volume 6507 of LNCS, pages 291--302. Springer, 2010.Google ScholarGoogle Scholar
  43. Alexandros Tzannes. Enhancing Productivity and Performance Portability of General-Purpose Parallel Programming. PhD thesis, University of Maryland, 2012.Google ScholarGoogle Scholar
  44. Seiji Umatani, Masahiro Yasugi, Tsuneyasu Komiya, and Taiichi Yuasa. Pursuing laziness for efficient implementation of modern multithreaded languages. In ISHPC, pages 174--188, 2003.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Scheduling parallel programs by work stealing with private deques

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
      February 2013
      332 pages
      ISBN:9781450319225
      DOI:10.1145/2442516
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 48, Issue 8
        PPoPP '13
        August 2013
        309 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2517327
        Issue’s Table of Contents

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 February 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate230of1,014submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader