skip to main content
10.1145/2576195.2576207acmconferencesArticle/Chapter ViewAbstractPublication PagesveeConference Proceedingsconference-collections
research-article

Friendly barriers: efficient work-stealing with return barriers

Published:01 March 2014Publication History

ABSTRACT

This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential parallelism and the runtime then schedules the work. Work-stealing is a promising scheduling strategy that a runtime may use to keep otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. Recent work identified sequential overheads of work-stealing, those that occur even when no stealing takes place, as a significant source of overhead. That work was able to reduce sequential overheads to just 15%.

In this work, we turn to dynamic overheads, those that occur each time a steal takes place. We show that the dynamic overhead is dominated by introspection of the victim's stack when a steal takes place. We exploit the idea of a low overhead return barrier to reduce the dynamic overhead by approximately half, resulting in total performance improvements of as much as 20%. Because, unlike prior work, we attack the overheads directly due to stealing and therefore attack the overheads that grow as parallelism grows, we improve the scalability of work-stealing applications. This result is complementary to recent work addressing the sequential overheads of work-stealing. This work therefore substantially relieves work-stealing of the increasing pressure due to increasing intra-node hardware parallelism.

References

  1. U. A. Acar, A. Chargueraud, and M. Rainey. Scheduling parallel programs by work stealing with private deques. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 219--228, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1922-5. 10.1145/2442516.2442538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. B. Alpern, C. Attanasio, J. Barton, M. Burke, P. Cheng, J. Choi, A. Cocchi, S. Fink, D. Grove, M. Hind, et al. The Jalape no virtual machine. IBM Systems Journal, 39 (1): 211--238, 2010. ISSN 0018-8670. 10.1147/sj.391.0211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Arnold, S. Fink, D. Grove, M. Hind, and P. F. Sweeney. Adaptive optimization in the Jalape no JVM. In Proceedings of the 15th ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '00, pages 47--65, New York, NY, USA, 2000. ACM. ISBN 1-58113-200-X. 10.1145/353171.353175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '98, pages 119--129, New York, NY, USA, 1998. ACM. ISBN 0-89791-989-0. 10.1145/277651.277678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Berenbrink, T. Friedetzky, and L. A. Goldberg. The natural work-stealing algorithm is stable. SIAM J. Comput., 32 (5): 1260--1279, May 2003. ISSN 0097-5397. 10.1137/S0097539701399551. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46 (5): 720--748, Sept. 1999. ISSN 0004-5411. 10.1145/324133.324234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cavé, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: the new adventures of old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ '11, pages 51--61, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0935-6. 10.1145/2093157.2093165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. ISBN 1-59593-031-0. 10.1145/1094811.1094852. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In Proceedings of the 2008 37th International Conference on Parallel Processing, ICPP '08, pages 536--545, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0-7695-3374-2. 10.1109/ICPP.2008.88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 53:1--53:11, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-744-8. 10.1145/1654059.1654113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. J. Fink and F. Qian. Design, implementation and evaluation of adaptive recompilation with on-stack replacement. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO '03, pages 241--252, Washington, DC, USA, 2003. IEEE Computer Society. ISBN 0-7695-1913-X. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Frigo, H. Prokop, M. Frigo, C. Leiserson, H. Prokop, S. Ramachandran, D. Dailey, C. Leiserson, I. Lyubashevskiy, N. Kushman, et al. The Cilk project. Algorithms, 1998.Google ScholarGoogle Scholar
  13. Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS '09, pages 1--12, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978-1-4244-3751-1. 10.1109/IPDPS.2009.5161079. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Guo, J. Zhao, V. Cave, and V. Sarkar. SLAW: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 341--342, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-877-3. 10.1145/1693453.1693504. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the Twenty-first Annual Symposium on Principles of Distributed Computing, PODC '02, pages 280--289, New York, NY, USA, 2002. ACM. ISBN 1-58113-485-1. 10.1145/571825.571876. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Hoffmann and T. Rauber. Adaptive task pools: efficiently balancing large number of tasks on shared-address spaces. International Journal of Parallel Programming, 39 (5): 553--581, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  17. U. Hölzle, C. Chambers, and D. Ungar. Debugging optimized code with dynamic deoptimization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation, PLDI '92, pages 32--43, New York, NY, USA, 1992. ACM. ISBN 0-89791-475-9. 10.1145/143095.143114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Intel Corporation. Using the RDTSC instruction for performance monitoring, 1997. URL http://www.intel.com.au/content/dam/www/public/us/en/documents/white-pa%pers/ia-32-ia-64-benchmark-code-execution-paper.pdf.Google ScholarGoogle Scholar
  19. G. Kliot, E. Petrank, and B. Steensgaard. A lock-free, concurrent, and incremental stack scanning for garbage collectors. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE '09, pages 11--20, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-375-4. 10.1145/1508293.1508296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali. Lonestar: A suite of parallel irregular programs. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 65--76, 2009. 10.1109/ISPASS.2009.4919639.Google ScholarGoogle ScholarCross RefCross Ref
  21. V. Kumar, D. Frampton, S. M. Blackburn, D. Grove, and O. Tardieu. Work-stealing without the baggage. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '12, pages 297--314, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1561-6. 10.1145/2384616.2384639. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Lea. A Java Fork/Join framework. In Proceedings of the ACM 2000 Conference on Java Grande, JAVA '00, pages 36--43, New York, NY, USA, 2000. ACM. ISBN 1-58113-288-3. 10.1145/337449.337465. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Lublinerman, J. Zhao, Z. Budimlić, S. Chaudhuri, and V. Sarkar. Delegated isolation. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '11, pages 885--902, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0940-0. 10.1145/2048066.2048133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Lüling and B. Monien. A dynamic distributed load balancing algorithm with provable good performance. In Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '93, pages 164--172, New York, NY, USA, 1993. ACM. ISBN 0-89791-599-2. 10.1145/165231.165252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Mitzenmacher. Analyses of load stealing models based on differential equations. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '98, pages 212--221, New York, NY, USA, 1998. ACM. ISBN 0-89791-989-0. 10.1145/277651.277687. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. Mohr, D. A. Kranz, and R. H. Halstead, Jr. Lazy task creation: A technique for increasing the granularity of parallel programs. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming, LFP '90, pages 185--197, New York, NY, USA, 1990. ACM. ISBN 0-89791-368-X. 10.1145/91556.91631. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: an unbalanced tree search benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing, LCPC'06, pages 235--250, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 978-3-540-72520-6. URL http://dl.acm.org/citation.cfm?id=1757112.1757137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Reinders. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O'Reilly Media, Inc., 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. L. Rudolph, M. Slivkin-Allalouf, and E. Upfal. A simple load balancing scheme for task allocation in parallel machines. In Proceedings of the Third Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '91, pages 237--245, New York, NY, USA, 1991. ACM. ISBN 0-89791-438-4. 10.1145/113379.113401. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. Saiki, Y. Konaka, T. Komiya, M. Yasugi, and T. Yuasa. Real-time GC in JeRTy#8482; VM using the return-barrier method. In Proceedings of the Eighth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing, ISORC '05, pages 140--148, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2356-0. 10.1109/ISORC.2005.45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible architectural support for fine-grain scheduling. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 311--322, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-839-1. 10.1145/1736020.1736055. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. V. Sundaresan, D. Maier, P. Ramarao, and M. Stoodley. Experiences with multi-threading and dynamic class loading in a Java Just-In-Time compiler. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '06, pages 87--97, Washington, DC, USA, 2006. IEEE Computer Society. ISBN 0-7695-2499-0. 10.1109/CGO.2006.16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Umatani, M. Yasugi, T. Komiya, and T. Yuasa. Pursuing laziness for efficient implementation of modern multithreaded languages. In A. Veidenbaum, K. Joe, H. Amano, and H. Aiso, editors, High Performance Computing, volume 2858 of Lecture Notes in Computer Science, pages 174--188. Springer Berlin Heidelberg, 2003. ISBN 978-3-540-20359-9. 10.1007/978-3-540-39707-6_13.Google ScholarGoogle Scholar
  34. E. Westbrook, J. Zhao, Z. Budimlić, and V. Sarkar. Practical permissions for race-free parallelism. In Proceedings of the 26th European Conference on Object-Oriented Programming, ECOOP'12, pages 614--639, Berlin, Heidelberg, 2012. Springer-Verlag. ISBN 978-3-642-31056-0. 10.1007/978-3-642-31057-7_27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. Yuasa. Real-time garbage collection on general-purpose machines. J. Syst. Softw., 11 (3): 181--198, Mar. 1990. ISSN 0164-1212. 10.1016/0164-1212(90)90084-Y. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. Yuasa, Y. Nakagawa, T. Komiyay, and M. Yasugiy. Return barrier. In Proceedings of the International Lisp Conference, 2002.Google ScholarGoogle Scholar
  37. J. Zhao, R. Lublinerman, Z. Budimlić, S. Chaudhuri, and V. Sarkar. Isolation for nested task parallelism. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '13, pages 571--588, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2374-1. 10.1145/2509136.2509534. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Friendly barriers: efficient work-stealing with return barriers

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      VEE '14: Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
      March 2014
      236 pages
      ISBN:9781450327640
      DOI:10.1145/2576195

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 March 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      VEE '14 Paper Acceptance Rate18of56submissions,32%Overall Acceptance Rate80of235submissions,34%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader