ABSTRACT
This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential parallelism and the runtime then schedules the work. Work-stealing is a promising scheduling strategy that a runtime may use to keep otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. Recent work identified sequential overheads of work-stealing, those that occur even when no stealing takes place, as a significant source of overhead. That work was able to reduce sequential overheads to just 15%.
In this work, we turn to dynamic overheads, those that occur each time a steal takes place. We show that the dynamic overhead is dominated by introspection of the victim's stack when a steal takes place. We exploit the idea of a low overhead return barrier to reduce the dynamic overhead by approximately half, resulting in total performance improvements of as much as 20%. Because, unlike prior work, we attack the overheads directly due to stealing and therefore attack the overheads that grow as parallelism grows, we improve the scalability of work-stealing applications. This result is complementary to recent work addressing the sequential overheads of work-stealing. This work therefore substantially relieves work-stealing of the increasing pressure due to increasing intra-node hardware parallelism.
- U. A. Acar, A. Chargueraud, and M. Rainey. Scheduling parallel programs by work stealing with private deques. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 219--228, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1922-5. 10.1145/2442516.2442538. Google ScholarDigital Library
- B. Alpern, C. Attanasio, J. Barton, M. Burke, P. Cheng, J. Choi, A. Cocchi, S. Fink, D. Grove, M. Hind, et al. The Jalape no virtual machine. IBM Systems Journal, 39 (1): 211--238, 2010. ISSN 0018-8670. 10.1147/sj.391.0211. Google ScholarDigital Library
- M. Arnold, S. Fink, D. Grove, M. Hind, and P. F. Sweeney. Adaptive optimization in the Jalape no JVM. In Proceedings of the 15th ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '00, pages 47--65, New York, NY, USA, 2000. ACM. ISBN 1-58113-200-X. 10.1145/353171.353175. Google ScholarDigital Library
- N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '98, pages 119--129, New York, NY, USA, 1998. ACM. ISBN 0-89791-989-0. 10.1145/277651.277678. Google ScholarDigital Library
- P. Berenbrink, T. Friedetzky, and L. A. Goldberg. The natural work-stealing algorithm is stable. SIAM J. Comput., 32 (5): 1260--1279, May 2003. ISSN 0097-5397. 10.1137/S0097539701399551. Google ScholarDigital Library
- R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46 (5): 720--748, Sept. 1999. ISSN 0004-5411. 10.1145/324133.324234. Google ScholarDigital Library
- Cavé, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: the new adventures of old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ '11, pages 51--61, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0935-6. 10.1145/2093157.2093165. Google ScholarDigital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. ISBN 1-59593-031-0. 10.1145/1094811.1094852. Google ScholarDigital Library
- G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In Proceedings of the 2008 37th International Conference on Parallel Processing, ICPP '08, pages 536--545, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0-7695-3374-2. 10.1109/ICPP.2008.88. Google ScholarDigital Library
- J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 53:1--53:11, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-744-8. 10.1145/1654059.1654113. Google ScholarDigital Library
- S. J. Fink and F. Qian. Design, implementation and evaluation of adaptive recompilation with on-stack replacement. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO '03, pages 241--252, Washington, DC, USA, 2003. IEEE Computer Society. ISBN 0-7695-1913-X. Google ScholarDigital Library
- M. Frigo, H. Prokop, M. Frigo, C. Leiserson, H. Prokop, S. Ramachandran, D. Dailey, C. Leiserson, I. Lyubashevskiy, N. Kushman, et al. The Cilk project. Algorithms, 1998.Google Scholar
- Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS '09, pages 1--12, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978-1-4244-3751-1. 10.1109/IPDPS.2009.5161079. Google ScholarDigital Library
- Y. Guo, J. Zhao, V. Cave, and V. Sarkar. SLAW: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 341--342, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-877-3. 10.1145/1693453.1693504. Google ScholarDigital Library
- D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the Twenty-first Annual Symposium on Principles of Distributed Computing, PODC '02, pages 280--289, New York, NY, USA, 2002. ACM. ISBN 1-58113-485-1. 10.1145/571825.571876. Google ScholarDigital Library
- R. Hoffmann and T. Rauber. Adaptive task pools: efficiently balancing large number of tasks on shared-address spaces. International Journal of Parallel Programming, 39 (5): 553--581, 2011.Google ScholarCross Ref
- U. Hölzle, C. Chambers, and D. Ungar. Debugging optimized code with dynamic deoptimization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation, PLDI '92, pages 32--43, New York, NY, USA, 1992. ACM. ISBN 0-89791-475-9. 10.1145/143095.143114. Google ScholarDigital Library
- Intel Corporation. Using the RDTSC instruction for performance monitoring, 1997. URL http://www.intel.com.au/content/dam/www/public/us/en/documents/white-pa%pers/ia-32-ia-64-benchmark-code-execution-paper.pdf.Google Scholar
- G. Kliot, E. Petrank, and B. Steensgaard. A lock-free, concurrent, and incremental stack scanning for garbage collectors. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE '09, pages 11--20, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-375-4. 10.1145/1508293.1508296. Google ScholarDigital Library
- M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali. Lonestar: A suite of parallel irregular programs. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 65--76, 2009. 10.1109/ISPASS.2009.4919639.Google ScholarCross Ref
- V. Kumar, D. Frampton, S. M. Blackburn, D. Grove, and O. Tardieu. Work-stealing without the baggage. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '12, pages 297--314, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1561-6. 10.1145/2384616.2384639. Google ScholarDigital Library
- D. Lea. A Java Fork/Join framework. In Proceedings of the ACM 2000 Conference on Java Grande, JAVA '00, pages 36--43, New York, NY, USA, 2000. ACM. ISBN 1-58113-288-3. 10.1145/337449.337465. Google ScholarDigital Library
- R. Lublinerman, J. Zhao, Z. Budimlić, S. Chaudhuri, and V. Sarkar. Delegated isolation. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '11, pages 885--902, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0940-0. 10.1145/2048066.2048133. Google ScholarDigital Library
- R. Lüling and B. Monien. A dynamic distributed load balancing algorithm with provable good performance. In Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '93, pages 164--172, New York, NY, USA, 1993. ACM. ISBN 0-89791-599-2. 10.1145/165231.165252. Google ScholarDigital Library
- M. Mitzenmacher. Analyses of load stealing models based on differential equations. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '98, pages 212--221, New York, NY, USA, 1998. ACM. ISBN 0-89791-989-0. 10.1145/277651.277687. Google ScholarDigital Library
- E. Mohr, D. A. Kranz, and R. H. Halstead, Jr. Lazy task creation: A technique for increasing the granularity of parallel programs. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming, LFP '90, pages 185--197, New York, NY, USA, 1990. ACM. ISBN 0-89791-368-X. 10.1145/91556.91631. Google ScholarDigital Library
- S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: an unbalanced tree search benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing, LCPC'06, pages 235--250, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 978-3-540-72520-6. URL http://dl.acm.org/citation.cfm?id=1757112.1757137. Google ScholarDigital Library
- J. Reinders. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O'Reilly Media, Inc., 2010. Google ScholarDigital Library
- L. Rudolph, M. Slivkin-Allalouf, and E. Upfal. A simple load balancing scheme for task allocation in parallel machines. In Proceedings of the Third Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '91, pages 237--245, New York, NY, USA, 1991. ACM. ISBN 0-89791-438-4. 10.1145/113379.113401. Google ScholarDigital Library
- H. Saiki, Y. Konaka, T. Komiya, M. Yasugi, and T. Yuasa. Real-time GC in JeRTy#8482; VM using the return-barrier method. In Proceedings of the Eighth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing, ISORC '05, pages 140--148, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2356-0. 10.1109/ISORC.2005.45. Google ScholarDigital Library
- D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible architectural support for fine-grain scheduling. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 311--322, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-839-1. 10.1145/1736020.1736055. Google ScholarDigital Library
- V. Sundaresan, D. Maier, P. Ramarao, and M. Stoodley. Experiences with multi-threading and dynamic class loading in a Java Just-In-Time compiler. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '06, pages 87--97, Washington, DC, USA, 2006. IEEE Computer Society. ISBN 0-7695-2499-0. 10.1109/CGO.2006.16. Google ScholarDigital Library
- S. Umatani, M. Yasugi, T. Komiya, and T. Yuasa. Pursuing laziness for efficient implementation of modern multithreaded languages. In A. Veidenbaum, K. Joe, H. Amano, and H. Aiso, editors, High Performance Computing, volume 2858 of Lecture Notes in Computer Science, pages 174--188. Springer Berlin Heidelberg, 2003. ISBN 978-3-540-20359-9. 10.1007/978-3-540-39707-6_13.Google Scholar
- E. Westbrook, J. Zhao, Z. Budimlić, and V. Sarkar. Practical permissions for race-free parallelism. In Proceedings of the 26th European Conference on Object-Oriented Programming, ECOOP'12, pages 614--639, Berlin, Heidelberg, 2012. Springer-Verlag. ISBN 978-3-642-31056-0. 10.1007/978-3-642-31057-7_27. Google ScholarDigital Library
- T. Yuasa. Real-time garbage collection on general-purpose machines. J. Syst. Softw., 11 (3): 181--198, Mar. 1990. ISSN 0164-1212. 10.1016/0164-1212(90)90084-Y. Google ScholarDigital Library
- T. Yuasa, Y. Nakagawa, T. Komiyay, and M. Yasugiy. Return barrier. In Proceedings of the International Lisp Conference, 2002.Google Scholar
- J. Zhao, R. Lublinerman, Z. Budimlić, S. Chaudhuri, and V. Sarkar. Isolation for nested task parallelism. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '13, pages 571--588, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2374-1. 10.1145/2509136.2509534. Google ScholarDigital Library
Index Terms
- Friendly barriers: efficient work-stealing with return barriers
Recommendations
Work-stealing without the baggage
OOPSLA '12: Proceedings of the ACM international conference on Object oriented programming systems languages and applicationsWork-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle ...
Friendly barriers: efficient work-stealing with return barriers
VEE '14This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential ...
Work-stealing without the baggage
OOPSLA '12Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle ...
Comments