ABSTRACT
Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. Prior work has demonstrated that work-stealing is very effective in practice. However, work-stealing comes with a substantial overhead: as much as 2x to 12x slowdown over orthodox sequential code.
In this paper we identify the key sources of overhead in work-stealing schedulers and present two significant refinements to their implementation. We evaluate our work-stealing designs using a range of benchmarks, four different work-stealing implementations, including the popular fork-join framework, and a range of architectures. On these benchmarks, compared to orthodox sequential Java, our fastest design has an overhead of just 15%. By contrast, fork-join has a 2.3x overhead and the previous implementation of the system we use has an overhead of 4.1x. These results and our insight into the sources of overhead for work-stealing implementations give further hope to an already promising technique for exploiting increasingly available hardware parallelism.
- Intel® Cilk#8482; Plus sdk. URL http://software.intel.com/en-us/articles/intel-cilk-plus.Google Scholar
- V. Cavé, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: the new adventures of old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ '11, pages 51--61, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0935--6. 10.1145/2093157.2093165. Google ScholarDigital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. ISBN 1--59593-031-0. 10.1145/1094811.1094852. Google ScholarDigital Library
- G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In Proceedings of the 2008 37th International Conference on Parallel Processing, ICPP '08, pages 536--545, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0--7695--3374--2. 10.1109/ICPP.2008.88. Google ScholarDigital Library
- A. Duran, J. Corbalán, and E. Ayguadé. Evaluation of OpenMP task scheduling strategies. In Proceedings of the 4th international conference on OpenMP in a new era of parallelism, IWOMP'08, pages 100--110, Berlin, Heidelberg, 2008. Springer-Verlag. ISBN 3--540--79560-X, 978--3--540--79560--5. URL http://dl.acm.org/citation.cfm?id=1789826.1789838. Google ScholarDigital Library
- A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP tasks suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In Proceedings of the 2009 International Conference on Parallel Processing, ICPP '09, pages 124--131, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978-0--7695--3802-0. 10.1109/ICPP.2009.64. Google ScholarDigital Library
- S. J. Fink and F. Qian. Design, implementation and evaluation of adaptive recompilation with on-stack replacement. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO '03, pages 241--252, Washington, DC, USA, 2003. IEEE Computer Society. ISBN 0--7695--1913-X. URL http://dl.acm.org/citation.cfm?id=776261.776288. Google ScholarDigital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, PLDI '98, pages 212--223, New York, NY, USA, 1998. ACM. ISBN 0--89791--987--4. 10.1145/277650.277725. Google ScholarDigital Library
- Frigo, Prokop, Frigo, Leiserson, Prokop, Ramachandran, Dailey, Leiserson, Lyubashevskiy, Kushman, et al.}frigo1998cilkM. Frigo, H. Prokop, M. Frigo, C. Leiserson, H. Prokop, S. Ramachandran, D. Dailey, C. Leiserson, I. Lyubashevskiy, N. Kushman, et al. The Cilk project. Algorithms, 1998.Google Scholar
- U. Hölzle, C. Chambers, and D. Ungar. Debugging optimized code with dynamic deoptimization. In Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation, PLDI '92, pages 32--43, New York, NY, USA, 1992. ACM. ISBN 0--89791--475--9. 10.1145/143103.143114. Google ScholarDigital Library
- P. Kambadur, A. Gupta, A. Ghoting, H. Avron, and A. Lumsdaine. PFunc: modern task parallelism for modern high performance computing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 43:1--43:11, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--744--8. 10.1145/1654059.1654103. Google ScholarDigital Library
- D. Lea. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande, JAVA '00, pages 36--43, New York, NY, USA, 2000. ACM. ISBN 1--58113--288--3. 10.1145/337449.337465. Google ScholarDigital Library
- D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel library. In Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, OOPSLA '09, pages 227--242, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--766-0. 10.1145/1640089.1640106. Google ScholarDigital Library
- H.-W. Loidl and K. Hammond. On the granularity of divide-and-conquer parallelism. In Proceedings of the 1995 International Conference on Functional Programming, FP'95, pages 135--144, Swinton, UK, UK, 1995. British Computer Society. URL http://dl.acm.org/citation.cfm?id=2227330.2227343. Google ScholarDigital Library
- J. Mellor-Crummey. Cilk+, parallel performance, and the cilk runtime system. URL http://www.clear.rice.edu/comp422/lecture-notes/comp422--2012-Lecture5-Cilk.pdf.Google Scholar
- MIT. The Cilk project. URL http://supertech.csail.mit.edu/cilk/index.html.Google Scholar
- E. Mohr, D. Kranz, Halstead, and J. R.H. Lazy task creation: A technique for increasing the granularity of parallel programs. Technical report, Cambridge, MA, USA, 1991. Google ScholarDigital Library
- J. Reinders. phIntel threading building blocks: outfitting C+ for multi-core processor parallelism. O'Reilly Media, Inc., 2007. Google ScholarDigital Library
- O. Tardieu, H. Wang, and H. Lin. A work-stealing scheduler for X10's task parallelism with suspension. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 267--276, New York, NY, USA, 2012. ACM. ISBN 978--1--4503--1160--1. 10.1145/2145816.2145850. Google ScholarDigital Library
- L. Wang, H. Cui, Y. Duan, F. Lu, X. Feng, and P.-C. Yew. An adaptive task creation strategy for work-stealing scheduling. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, CGO '10, pages 266--277, New York, NY, USA, 2010. ACM. ISBN 978--1--60558--635--9. 10.1145/1772954.1772992. Google ScholarDigital Library
- X. Yang, S. M. Blackburn, D. Frampton, J. B. Sartor, and K. S. McKinley. Why nothing matters: the impact of zeroing. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications, OOPSLA '11, pages 307--324, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0940-0. 10.1145/2048066.2048092. Google ScholarDigital Library
Index Terms
- Work-stealing without the baggage
Recommendations
Friendly barriers: efficient work-stealing with return barriers
VEE '14: Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environmentsThis paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential ...
Work-stealing without the baggage
OOPSLA '12Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle ...
A work-stealing scheduler for X10's task parallelism with suspension
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingThe X10 programming language is intended to ease the programming of scalable concurrent and distributed applications. X10 augments a familiar imperative object-oriented programming model with constructs to support light-weight asynchronous tasks as well ...
Comments