Abstract
The emergence of the multi-core era has led to increased interest in designing effective yet practical parallel programming models. Models based on task graphs that operate on single-assignment data are attractive in several ways. Notably, they can support dynamic applications and precisely represent the available concurrency. However, for efficient execution, they also require nuanced algorithms for scheduling and memory management. In this article, we consider memory-efficient dynamic scheduling of task graphs. Specifically, we present a novel approach for dynamically recycling the memory locations assigned to data items as they are produced by tasks. We develop algorithms to identify memory-efficient store recycling functions by systematically evaluating the validity of a set of user-provided or automatically generated alternatives. Because recycling functions can be input data-dependent, we have also developed support for continued correct execution of a task graph in the presence of a potentially incorrect store recycling function.
Experimental evaluation demonstrates that this approach to automatic store recycling incurs little to no overheads, achieves memory usage comparable to the best manually derived solutions, often produces recycling functions valid across problem sizes and input parameters, and efficiently recovers from an incorrect choice of store recycling functions.
- Samah Abu-Mahmeed, Cheryl McCosh, Zoran Budimlić, Ken Kennedy, Kaushik Ravindran, Kevin Hogan, Paul Austin, Steve Rogers, and Jacob Kornerup. 2009. Scheduling tasks to maximize usage of aggregate variables in place. In CC. 204--219. Google ScholarDigital Library
- Kunal Agrawal, Charles E. Leiserson, and Jim Sukha. 2010. Executing task graphs using work-stealing. In Proceedings of the IEEE 24th International Symposium on Parallel 8 Distributed Processing (IPDPS). 1--12. Google ScholarCross Ref
- Arvind. 1981. Data flow languages and architecture. In Proceedings of the 8th Annual Symposium on Computer Architecture (ISCA’ 81). 1.Google Scholar
- Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2 (2011), 187--198. Google ScholarDigital Library
- François Broquedis, Thierry Gautier, and Vincent Danjean. 2012. LIBKOMP, an efficient openMP runtime system for both fork-join and data flow paradigms. In Proceedings of the 8th International Conference on OpenMP in a Heterogeneous World (IWOMP’12). Springer-Verlag, Berlin, 102--115. DOI:http://dx.doi.org/10.1007/978-3-642-30961-8_8 Google ScholarDigital Library
- Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, and others. 2010. Concurrent collections. Scientific Programming 18, 3 (2010), 203--217. Google ScholarDigital Library
- Zoran Budimlic, Aparna M. Chandramowlishwaran, Kathleen Knobe, Geoff N. Lowney, Vivek Sarkar, and Leo Treggiari. 2008. Declarative aspects of memory management in the concurrent collections parallel programming model. In Proceedings of the Workshop on Declarative Aspects of Multicore Programming. 47--58. DOI:http://dx.doi.org/10.1145/1481839.1481846 Google ScholarDigital Library
- A. Bui, Kwang-Ting Cheng, J. Cong, L. Vese, Yi-Chu Wang, Bo Yuan, and Yi Zou. 2012. Platform characterization for domain-specific computing. In ASP-DAC. 94--99.Google Scholar
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE, 44--54. Google ScholarDigital Library
- Jack B. Dennis. 1974. First version of a data flow procedure language. In Programming Symposium. Springer, 362--376. Google ScholarCross Ref
- Alejandro Duran, Eduard Ayguadé, Rosa M. Badia, Jesús Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. 2011. OmpSs: A Proposal for Programming Heterogeneous Multi-core Architectures. Parallel Processing Letters 21, 2 (2011), 173--193. Google ScholarCross Ref
- Thierry Gautier, Joao V. F. Lima, Nicolas Maillard, and Bruno Raffin. 2013. XKaapi: A runtime system for data-flow task programming on heterogeneous architectures. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel 8 Distributed Processing (IPDPS). IEEE, 1299--1308. Google ScholarDigital Library
- Léonard Gérard, Adrien Guatto, Cédric Pasteur, and Marc Pouzet. 2012. A modular memory optimization for synchronous data-flow languages: Application to arrays in a lustre compiler. SIGPLAN Not. 47, 5 (June 2012), 51--60. Google ScholarDigital Library
- Paul Hudak and Adrienne G. Bloss. 1985. The aggregate update problem in functional programming systems. In POPL. 300--314. Google ScholarDigital Library
- Prabhanjan Kambadur, Anshul Gupta, Torsten Hoefler, and Andrew Lumsdaine. 2009. Demand-driven execution of static directed acyclic graphs using task parallelism. In Proceedings of the 2009 International Conference on High Performance Computing (HiPC). IEEE, 284--293. Google ScholarCross Ref
- Mehmet Can Kurt, Sriram Krishnamoorthy, Kunal Agrawal, and Gagan Agrawal. 2014. Fault-tolerant dynamic task graph scheduling. In SC. 719--730. Google ScholarDigital Library
- Yu-Kwong Kwok and Ishfaq Ahmad. 1999. Static scheduling algorithms for allocating directed task graphs to multiprocessors. Comput. Surveys 31, 4 (Dec. 1999), 406--471. Google ScholarDigital Library
- Malcolm Yoke Hean Low, Weiguo Liu, and Bertil Schmidt. 2007. A parallel BSP algorithm for irregular dynamic programming. In APPT. 151--160.Google Scholar
- Nicholas D. Matsakis. 2012. Parallel closures: A new twist on an old idea. In HotPar. 5--5.Google Scholar
- Walid A. Najjar, Edward A. Lee, and Guang R. Gao. 1999. Advances in the dataflow computational model. Parallel Comput. 25, 1314 (1999), 1907--1929. Google ScholarDigital Library
- Antoniu Pop and Albert Cohen. 2013. OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 53.Google ScholarDigital Library
- Polyvios Pratikakis, Hans Vandierendonck, Spyros Lyberis, and Dimitrios S. Nikolopoulos. 2011. A programming model for deterministic task parallelism. In Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness. ACM, 7--12. Google ScholarDigital Library
- Veselin Raychev, Martin Vechev, and Manu Sridharan. 2013. Effective race detection for event-driven programs. In OOPSLA. 151--166. DOI:http://dx.doi.org/10.1145/2509136.2509538 Google ScholarDigital Library
- Dragos Sbirlea, Zoran Budimlic, and Vivek Sarkar. 2014. Bounded memory scheduling of dynamic task graphs. In PACT. 343--356. DOI:http://dx.doi.org/10.1145/2628071.2628090 Google ScholarDigital Library
- Dragos Sbirlea, Kathleen Knobe, and Vivek Sarkar. 2012. Folding of tagged single assignment values for memory-efficient parallelism. In Euro-Par. 601--613. Google ScholarDigital Library
- Peter Schnorf, Mahadevan Ganapathi, and John L. Hennessy. 1993. Compile-time copy elimination. Software: Practice and Experience 23, 11 (1993), 1175--1200. Google ScholarDigital Library
- Sagnak Tasirlar and Vivek Sarkar. 2011. Data-driven tasks and their implementation. In ICPP. 652--661. DOI:http://dx.doi.org/10.1109/ICPP.2011.87 Google ScholarDigital Library
- Asim YarKhan, Jakub Kurzak, and Jack Dongarra. 2011. Quark users guide: Queueing and runtime for kernels. University of Tennessee Innovative Computing Laboratory Technical Report ICL-UT-11-02.Google Scholar
Index Terms
- User-Assisted Store Recycling for Dynamic Task Graph Schedulers
Recommendations
User-assisted storage reuse determination for dynamic task graphs
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingModels based on task graphs that operate on single-assignment data are attractive in several ways, but also require nuanced algorithms for scheduling and memory management for efficient execution. In this paper, we consider memory-efficient dynamic ...
User-assisted storage reuse determination for dynamic task graphs
PPoPP '16Models based on task graphs that operate on single-assignment data are attractive in several ways, but also require nuanced algorithms for scheduling and memory management for efficient execution. In this paper, we consider memory-efficient dynamic ...
Enabling Hybrid PCM Memory System with Inherent Memory Management
RACS '16: Proceedings of the International Conference on Research in Adaptive and Convergent SystemsReplacing the traditional volatile main memory, e.g., DRAM, with a non-volatile phase change memory (PCM) has become a possible solution to reduce the energy consumption of computing systems. To further reduce the bit cost of PCM, the development trend ...
Comments