Abstract
Today’s MPSoCs (multiprocessor systems-on-chip) have brought up massively parallel processor array accelerators that may achieve a high computational efficiency by exploiting multiple levels of parallelism and different memory hierarchies. Such parallel processor arrays are perfect targets, particularly for the acceleration of nested loop programs due to their regular and massively parallel nature. However, existing loop parallelization techniques are often unable to exploit multiple levels of parallelism and are either I/O or memory bounded. Furthermore, if the number of available processing elements becomes only known at runtime—as in adaptive systems—static approaches fail. In this article, we solve some of these problems by proposing a hybrid compile/runtime multi-level symbolic parallelization technique that is able to: (a) exploit multiple levels of parallelism as well as (b) different memory hierarchies, and (c) to match the I/O or memory capabilities of the target architecture for scenarios where the number of available processing elements is only known at runtime. Our proposed technique consists of two compile-time transformations: (a) symbolic hierarchical tiling followed by (b) symbolic multi-level scheduling. The tiling levels scheduled in parallel exploit different levels of parallelism, whereas the sequential one, different memory hierarchies. Furthermore, by tuning the size of the tiles on the individual levels, a tradeoff between the necessary I/O-bandwidth and memory is possible, which facilitates obeying resource constraints. The resulting schedules are symbolic with respect to the problem size and tile sizes. Thus, the number of processing elements to map onto does not need to be known at compile time. At runtime, when the number of available processors becomes known, a simple prologue chooses a feasible schedule with respect to I/O and memory constraints that is latency-optimal for the chosen tile size. In summary, our approach determines the set of feasible, latency-optimal symbolic loop schedule candidates at compile time, from which one is dynamically selected at runtime. This approach exploits multiple levels of parallelism, is independent of the problem size of the loop nest, and thereby avoids any expensive re-compilation at runtime. This is particularly important for low cost and memory-scarce embedded MPSoC platforms that may not afford to host a just-in-time compiler.
- Alain Darte, Leonid Khachiyan, and Yves Robert. 1992. Linear scheduling is close to optimality. In Proceedings of the 1992 International Conference on Application Specific Array Processors. 37--46.Google ScholarCross Ref
- Alain Darte and Yves Robert. 1995. Affine-by-statement scheduling of uniform and affine loop nests over parametric domains. Journal of Parallel and Distributed Computing 29, 1 (1995), 43--59. Google ScholarDigital Library
- Alain Darte, Robert Schreiber, B. Ramakrishna Rau, Frederic Vivien, B. Ramakrishna, and Rau Frdric Vivien. 2000. A constructive solution to the juggling problem in systolic array synthesis. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’00). 815--821. Google ScholarDigital Library
- Benoît Dupont de Dinechin, Renaud Ayrignac, Pierre-Edouard Beaucamps, Patrice Couvert, Benoit Ganne, Pierre Guironnet de Massas, François Jacquet, Samuel Jones, Nicolas Morey Chaisemartin, Frédéric Riss, and others. 2013. A clustered manycore processor architecture for embedded and accelerated applications. In HPEC. 1--6.Google Scholar
- Andrew Duller, Gajinder Panesar, and Daniel Towner. 2003. Parallel processing—The picoChip way!. In Proceedings of Communicating Process Architectures (CPA’03). IOS Press, Enschede, The Netherlands, 125--138.Google Scholar
- Hritam Dutta, Frank Hannig, and Jürgen Teich. 2006. Hierarchical partitioning for piecewise linear algorithms. In Proceedings of the 5th International Conference on Parallel Computing in Electrical Engineering (PARELEC’06). IEEE Computer Society, 153--160. Google ScholarDigital Library
- Uwe Eckhardt and Renate Merker. 1999. Hierarchical algorithm partitioning at system level for an improved utilization of memory structures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 18, 1 (Jan 1999), 14--24. Google ScholarDigital Library
- Frank Hannig, Vahid Lari, Srinivas Boppu, Alexandru Tanase, and Oliver Reiche. 2014. Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach. ACM Transactions on Embedded Computing Systems (TECS) 13, 4s (April 2014), 133:1--133:29. Google ScholarDigital Library
- Albert Hartono, Muthu Manikandan Baskaran, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2010. DynTile: Parametric tiled loop generation for parallel execution on multicore processors. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’10). 1--12.Google ScholarCross Ref
- Albert Hartono, Muthu Manikandan Baskaran, Cédric Bastoul, Albert Cohen, Sriram Krishnamoorthy, Boyana Norris, J. Ramanujam, and P. Sadayappan. 2009. Parametric multi-level tiling of imperfectly nested loops. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM, 147--157. Google ScholarDigital Library
- J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. V. D. Wijngaart, and T. Mattson. 2010. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC’10). 108--109.Google Scholar
- Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. 2010. Power7: IBM’s next-generation server processor. IEEE Micro 30, 2 (March--April 2010), 7--15. Google ScholarDigital Library
- DaeGon Kim and Sanjay Rajopadhye. 2009. Efficient tiled loop generation: D-tiling. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing (LCPC) (Lecture Notes in Computer Science (LNCS)), Vol. 5898. Springer, 293--307. Google ScholarDigital Library
- DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Multi-level tiling: M for the price of one. In SC’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. ACM, New York, NY, 1--12. Google ScholarDigital Library
- Dmitrij Kissler, Daniel Gran, Zoran A. Salcic, Frank Hannig, and Jürgen Teich. 2011. Scalable many-domain power gating in coarse-grained reconfigurable processor arrays. IEEE Embedded Systems Letters 3, 2 (June 2011), 58--61. Google ScholarDigital Library
- Dmitrij Kissler, Frank Hannig, Alexey Kupriyanov, and Jürgen Teich. 2006. A highly parameterizable parallel processor array architecture. In Proceedings of the IEEE International Conference on Field Programmable Technology (FPT’06). IEEE, 105--112.Google ScholarCross Ref
- Mojtaba Mehrara, Thoma B. Jablin, Dan Upton, David I. August, Kim Hazelwood, and Scott Mahlke. 2009. Compilation strategies and challenges for multicore signal processing. IEEE Signal Processing Magazine 26, 6 (Nov. 2009), 55--63.Google ScholarCross Ref
- Lakshminarayanan Renganarayanan, DaeGon Kim, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Parameterized tiled loops for free. In Proceedings of the Conference on Programming Language Design and Implementation. 405--414. Google ScholarDigital Library
- Lakshminarayanan Renganarayanan, Daegon Kim, Michelle Mills Strout, and Sanjay Rajopadhye. 2010. Parameterized loop tiling. ACM Transactions on Programming Languages and Systems 34, 1 (April 2010), 3:1--3:41. Google ScholarDigital Library
- Alexandru Tanase, Michael Witterauf, Jürgen Teich, and Frank Hannig. 2014. Symbolic inner loop parallelisation for massively parallel processor arrays. In Proceedings of the 12th ACM-IEEE International Conference on Formal Methods and Models for System Design (MEMOCODE’14). 219--228. Google ScholarDigital Library
- Sanket Tavarageri, Albert Hartono, Muthu Baskaran, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2010. Parametric tiling of affine loop nests. In Proceedings of the 15th Workshop on Compilers for Parallel Computing (CPC). Vienna, Austria.Google Scholar
- Jürgen Teich. 1993. A Compiler for Application Specific Processor Arrays. Shaker.Google Scholar
- Jürgen Teich. 2008. Invasive algorithms and architectures. IT - Information Technology 50, 5 (2008), 300--310.Google ScholarCross Ref
- Jürgen Teich, Jörg Henkel, Andreas Herkersdorf, Doris Schmitt-Landsiedel, Wolfgang Schröder-Preikschat, and Gregor Snelting. 2011. Invasive computing: An overview. In Multiprocessor System-on-Chip -- Hardware Design and Tool Integration. Springer, 241--268.Google Scholar
- Jürgen Teich, Alexandru Tanase, and Frank Hannig. 2013. Symbolic parallelization of loop programs for massively parallel processor arrays. In Proceedings of the 24th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP’13). IEEE, 1--9. Google ScholarDigital Library
- Jürgen Teich, Alexandru Tanase, and Frank Hannig. 2014. Symbolic mapping of loop programs onto processor arrays. Journal of Signal Processing Systems 77, 1--2 (2014), 31--59. Google ScholarDigital Library
- Jürgen Teich, Lothar Thiele, and Li Zhang. 1996. Scheduling of partitioned regular algorithms on processor arrays with constrained resources. In Proceedings of International Conference on Application Specific Systems, Architectures and Processors (ASAP’96). 131--144. Google ScholarDigital Library
- Lothar Thiele. 1989. On the design of piecewise regular processor arrays. In Proceedings of the IEEE International Symposium on Circuits and Systems, Vol. 3. 2239--2242.Google ScholarCross Ref
- Lothar Thiele and Vwani Prasad Roychowdhury. 1991. Systematic design of local processor arrays for numerical algorithms. In Proceedings of the International Workshop on Algorithms and Parallel VLSI Architectures, Vol. A: Tutorials. Elsevier, Amsterdam, The Netherlands, 329--339.Google Scholar
- Tilera Corporation. 2013. Homepage Retrieved February 1, 2013 from http://www.tilera.com.Google Scholar
- Tao Yang and Oscar H. Ibarra. 1995. On symbolic scheduling and parallel complexity of loops. In Proceedings of the IEEE Symposium on Parallel and Distributed Processing. 360--367. Google ScholarDigital Library
- Tomofumi Yuki and Sanjay Rajopadhye. 2013. Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs. Technical Report. Citeseer.Google Scholar
Index Terms
- Symbolic Multi-Level Loop Mapping of Loop Programs for Massively Parallel Processor Arrays
Recommendations
Polyhedral parallel code generation for CUDA
Special Issue on High-Performance Embedded Architectures and CompilersThis article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static ...
Automatic Generation of Multi-Objective Polyhedral Compiler Transformations
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesTo this day, polyhedral optimizing compilers use either extremely rigid (but accurate) cost models, one-size-fits-all general-purpose heuristics, or auto-tuning strategies to traverse and evaluate large optimization spaces. In this paper, we introduce ...
Symbolic Loop Compilation for Tightly Coupled Processor Arrays
Tightly Coupled Processor Arrays (TCPAs), a class of massively parallel loop accelerators, allow applications to offload computationally expensive loops for improved performance and energy efficiency. To achieve these two goals, executing a loop on a ...
Comments