research-article

Friendly barriers: efficient work-stealing with return barriers

Authors:
Vivek Kumar

Australian National University, Canberra, Australia

Australian National University, Canberra, Australia
View Profile

,
Stephen M. Blackburn

Australian National University, Canberra, Australia

Australian National University, Canberra, Australia
View Profile

,
David Grove

IBM T.J. Watson Research, New York, NY, USA

IBM T.J. Watson Research, New York, NY, USA
View Profile

VEE '14: Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environmentsMarch 2014Pages 165–176https://doi.org/10.1145/2576195.2576207

Published:01 March 2014Publication History

VEE '14: Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments

Pages 165–176

ABSTRACT

This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential parallelism and the runtime then schedules the work. Work-stealing is a promising scheduling strategy that a runtime may use to keep otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. Recent work identified sequential overheads of work-stealing, those that occur even when no stealing takes place, as a significant source of overhead. That work was able to reduce sequential overheads to just 15%.

In this work, we turn to dynamic overheads, those that occur each time a steal takes place. We show that the dynamic overhead is dominated by introspection of the victim's stack when a steal takes place. We exploit the idea of a low overhead return barrier to reduce the dynamic overhead by approximately half, resulting in total performance improvements of as much as 20%. Because, unlike prior work, we attack the overheads directly due to stealing and therefore attack the overheads that grow as parallelism grows, we improve the scalability of work-stealing applications. This result is complementary to recent work addressing the sequential overheads of work-stealing. This work therefore substantially relieves work-stealing of the increasing pressure due to increasing intra-node hardware parallelism.

References

U. A. Acar, A. Chargueraud, and M. Rainey. Scheduling parallel programs by work stealing with private deques. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 219--228, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1922-5. 10.1145/2442516.2442538. Google ScholarDigital Library
B. Alpern, C. Attanasio, J. Barton, M. Burke, P. Cheng, J. Choi, A. Cocchi, S. Fink, D. Grove, M. Hind, et al. The Jalape no virtual machine. IBM Systems Journal, 39 (1): 211--238, 2010. ISSN 0018-8670. 10.1147/sj.391.0211. Google ScholarDigital Library
M. Arnold, S. Fink, D. Grove, M. Hind, and P. F. Sweeney. Adaptive optimization in the Jalape no JVM. In Proceedings of the 15th ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '00, pages 47--65, New York, NY, USA, 2000. ACM. ISBN 1-58113-200-X. 10.1145/353171.353175. Google ScholarDigital Library
N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '98, pages 119--129, New York, NY, USA, 1998. ACM. ISBN 0-89791-989-0. 10.1145/277651.277678. Google ScholarDigital Library
P. Berenbrink, T. Friedetzky, and L. A. Goldberg. The natural work-stealing algorithm is stable. SIAM J. Comput., 32 (5): 1260--1279, May 2003. ISSN 0097-5397. 10.1137/S0097539701399551. Google ScholarDigital Library
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46 (5): 720--748, Sept. 1999. ISSN 0004-5411. 10.1145/324133.324234. Google ScholarDigital Library
Cavé, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: the new adventures of old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ '11, pages 51--61, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0935-6. 10.1145/2093157.2093165. Google ScholarDigital Library
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. ISBN 1-59593-031-0. 10.1145/1094811.1094852. Google ScholarDigital Library
G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In Proceedings of the 2008 37th International Conference on Parallel Processing, ICPP '08, pages 536--545, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0-7695-3374-2. 10.1109/ICPP.2008.88. Google ScholarDigital Library
J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 53:1--53:11, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-744-8. 10.1145/1654059.1654113. Google ScholarDigital Library
S. J. Fink and F. Qian. Design, implementation and evaluation of adaptive recompilation with on-stack replacement. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO '03, pages 241--252, Washington, DC, USA, 2003. IEEE Computer Society. ISBN 0-7695-1913-X. Google ScholarDigital Library
M. Frigo, H. Prokop, M. Frigo, C. Leiserson, H. Prokop, S. Ramachandran, D. Dailey, C. Leiserson, I. Lyubashevskiy, N. Kushman, et al. The Cilk project. Algorithms, 1998.Google Scholar
Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS '09, pages 1--12, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978-1-4244-3751-1. 10.1109/IPDPS.2009.5161079. Google ScholarDigital Library
Y. Guo, J. Zhao, V. Cave, and V. Sarkar. SLAW: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 341--342, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-877-3. 10.1145/1693453.1693504. Google ScholarDigital Library
D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the Twenty-first Annual Symposium on Principles of Distributed Computing, PODC '02, pages 280--289, New York, NY, USA, 2002. ACM. ISBN 1-58113-485-1. 10.1145/571825.571876. Google ScholarDigital Library
R. Hoffmann and T. Rauber. Adaptive task pools: efficiently balancing large number of tasks on shared-address spaces. International Journal of Parallel Programming, 39 (5): 553--581, 2011.Google ScholarCross Ref
U. Hölzle, C. Chambers, and D. Ungar. Debugging optimized code with dynamic deoptimization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation, PLDI '92, pages 32--43, New York, NY, USA, 1992. ACM. ISBN 0-89791-475-9. 10.1145/143095.143114. Google ScholarDigital Library
Intel Corporation. Using the RDTSC instruction for performance monitoring, 1997. URL http://www.intel.com.au/content/dam/www/public/us/en/documents/white-pa%pers/ia-32-ia-64-benchmark-code-execution-paper.pdf.Google Scholar
G. Kliot, E. Petrank, and B. Steensgaard. A lock-free, concurrent, and incremental stack scanning for garbage collectors. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE '09, pages 11--20, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-375-4. 10.1145/1508293.1508296. Google ScholarDigital Library
M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali. Lonestar: A suite of parallel irregular programs. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 65--76, 2009. 10.1109/ISPASS.2009.4919639.Google ScholarCross Ref
V. Kumar, D. Frampton, S. M. Blackburn, D. Grove, and O. Tardieu. Work-stealing without the baggage. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '12, pages 297--314, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1561-6. 10.1145/2384616.2384639. Google ScholarDigital Library
D. Lea. A Java Fork/Join framework. In Proceedings of the ACM 2000 Conference on Java Grande, JAVA '00, pages 36--43, New York, NY, USA, 2000. ACM. ISBN 1-58113-288-3. 10.1145/337449.337465. Google ScholarDigital Library
R. Lublinerman, J. Zhao, Z. Budimlić, S. Chaudhuri, and V. Sarkar. Delegated isolation. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '11, pages 885--902, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0940-0. 10.1145/2048066.2048133. Google ScholarDigital Library
R. Lüling and B. Monien. A dynamic distributed load balancing algorithm with provable good performance. In Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '93, pages 164--172, New York, NY, USA, 1993. ACM. ISBN 0-89791-599-2. 10.1145/165231.165252. Google ScholarDigital Library
M. Mitzenmacher. Analyses of load stealing models based on differential equations. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '98, pages 212--221, New York, NY, USA, 1998. ACM. ISBN 0-89791-989-0. 10.1145/277651.277687. Google ScholarDigital Library
E. Mohr, D. A. Kranz, and R. H. Halstead, Jr. Lazy task creation: A technique for increasing the granularity of parallel programs. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming, LFP '90, pages 185--197, New York, NY, USA, 1990. ACM. ISBN 0-89791-368-X. 10.1145/91556.91631. Google ScholarDigital Library
S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: an unbalanced tree search benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing, LCPC'06, pages 235--250, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 978-3-540-72520-6. URL http://dl.acm.org/citation.cfm?id=1757112.1757137. Google ScholarDigital Library
J. Reinders. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O'Reilly Media, Inc., 2010. Google ScholarDigital Library
L. Rudolph, M. Slivkin-Allalouf, and E. Upfal. A simple load balancing scheme for task allocation in parallel machines. In Proceedings of the Third Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '91, pages 237--245, New York, NY, USA, 1991. ACM. ISBN 0-89791-438-4. 10.1145/113379.113401. Google ScholarDigital Library
H. Saiki, Y. Konaka, T. Komiya, M. Yasugi, and T. Yuasa. Real-time GC in JeRTy#8482; VM using the return-barrier method. In Proceedings of the Eighth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing, ISORC '05, pages 140--148, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2356-0. 10.1109/ISORC.2005.45. Google ScholarDigital Library
D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible architectural support for fine-grain scheduling. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 311--322, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-839-1. 10.1145/1736020.1736055. Google ScholarDigital Library
V. Sundaresan, D. Maier, P. Ramarao, and M. Stoodley. Experiences with multi-threading and dynamic class loading in a Java Just-In-Time compiler. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '06, pages 87--97, Washington, DC, USA, 2006. IEEE Computer Society. ISBN 0-7695-2499-0. 10.1109/CGO.2006.16. Google ScholarDigital Library
S. Umatani, M. Yasugi, T. Komiya, and T. Yuasa. Pursuing laziness for efficient implementation of modern multithreaded languages. In A. Veidenbaum, K. Joe, H. Amano, and H. Aiso, editors, High Performance Computing, volume 2858 of Lecture Notes in Computer Science, pages 174--188. Springer Berlin Heidelberg, 2003. ISBN 978-3-540-20359-9. 10.1007/978-3-540-39707-6_13.Google Scholar
E. Westbrook, J. Zhao, Z. Budimlić, and V. Sarkar. Practical permissions for race-free parallelism. In Proceedings of the 26th European Conference on Object-Oriented Programming, ECOOP'12, pages 614--639, Berlin, Heidelberg, 2012. Springer-Verlag. ISBN 978-3-642-31056-0. 10.1007/978-3-642-31057-7_27. Google ScholarDigital Library
T. Yuasa. Real-time garbage collection on general-purpose machines. J. Syst. Softw., 11 (3): 181--198, Mar. 1990. ISSN 0164-1212. 10.1016/0164-1212(90)90084-Y. Google ScholarDigital Library
T. Yuasa, Y. Nakagawa, T. Komiyay, and M. Yasugiy. Return barrier. In Proceedings of the International Lisp Conference, 2002.Google Scholar
J. Zhao, R. Lublinerman, Z. Budimlić, S. Chaudhuri, and V. Sarkar. Isolation for nested task parallelism. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '13, pages 571--588, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2374-1. 10.1145/2509136.2509534. Google ScholarDigital Library

Index Terms

Friendly barriers: efficient work-stealing with return barriers
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Work-stealing without the baggage
OOPSLA '12: Proceedings of the ACM international conference on Object oriented programming systems languages and applications

Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle ...
Read More
Friendly barriers: efficient work-stealing with return barriers
VEE '14

This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential ...
Read More
Work-stealing without the baggage
OOPSLA '12

Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
VEE '14: Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
March 2014
236 pages
ISBN:9781450327640
DOI:10.1145/2576195
General Chair:
Martin Hirzel
IBM Research
,
Program Chairs:
Erez Petrank
Technion
,
Dan Tsafrir
Technion
ACM SIGPLAN Notices Volume 49, Issue 7
VEE '14
July 2014
222 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2674025
Editors:
Mark W. Bailey
Hamilton College, Clinton, NY
,
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Sarita Adve
University of Illinois at Urbana-Champ
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
managed languages
scheduling
task parallelism
work-stealing
x10
Qualifiers
- research-article
Conference

Acceptance Rates
VEE '14 Paper Acceptance Rate18of56submissions,32%Overall Acceptance Rate80of235submissions,34%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 259
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Friendly barriers: efficient work-stealing with return barriers

VEE '14: Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments

ABSTRACT

References

Cited By

Index Terms

Recommendations

Work-stealing without the baggage

Friendly barriers: efficient work-stealing with return barriers

Work-stealing without the baggage