research-article

Symbolic Multi-Level Loop Mapping of Loop Programs for Massively Parallel Processor Arrays

Authors:
Alexandru Tanase

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany
View Profile

,
Michael Witterauf

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany
View Profile

,
Jürgen Teich

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany
View Profile

,
Frank Hannig

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany

Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany
View Profile

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 17 Issue 2Article No.: 31pp 1–27https://doi.org/10.1145/3092952

Published:07 December 2017Publication History

ACM Transactions on Embedded Computing Systems

Abstract

Today’s MPSoCs (multiprocessor systems-on-chip) have brought up massively parallel processor array accelerators that may achieve a high computational efficiency by exploiting multiple levels of parallelism and different memory hierarchies. Such parallel processor arrays are perfect targets, particularly for the acceleration of nested loop programs due to their regular and massively parallel nature. However, existing loop parallelization techniques are often unable to exploit multiple levels of parallelism and are either I/O or memory bounded. Furthermore, if the number of available processing elements becomes only known at runtime—as in adaptive systems—static approaches fail. In this article, we solve some of these problems by proposing a hybrid compile/runtime multi-level symbolic parallelization technique that is able to: (a) exploit multiple levels of parallelism as well as (b) different memory hierarchies, and (c) to match the I/O or memory capabilities of the target architecture for scenarios where the number of available processing elements is only known at runtime. Our proposed technique consists of two compile-time transformations: (a) symbolic hierarchical tiling followed by (b) symbolic multi-level scheduling. The tiling levels scheduled in parallel exploit different levels of parallelism, whereas the sequential one, different memory hierarchies. Furthermore, by tuning the size of the tiles on the individual levels, a tradeoff between the necessary I/O-bandwidth and memory is possible, which facilitates obeying resource constraints. The resulting schedules are symbolic with respect to the problem size and tile sizes. Thus, the number of processing elements to map onto does not need to be known at compile time. At runtime, when the number of available processors becomes known, a simple prologue chooses a feasible schedule with respect to I/O and memory constraints that is latency-optimal for the chosen tile size. In summary, our approach determines the set of feasible, latency-optimal symbolic loop schedule candidates at compile time, from which one is dynamically selected at runtime. This approach exploits multiple levels of parallelism, is independent of the problem size of the loop nest, and thereby avoids any expensive re-compilation at runtime. This is particularly important for low cost and memory-scarce embedded MPSoC platforms that may not afford to host a just-in-time compiler.

References

Alain Darte, Leonid Khachiyan, and Yves Robert. 1992. Linear scheduling is close to optimality. In Proceedings of the 1992 International Conference on Application Specific Array Processors. 37--46.Google ScholarCross Ref
Alain Darte and Yves Robert. 1995. Affine-by-statement scheduling of uniform and affine loop nests over parametric domains. Journal of Parallel and Distributed Computing 29, 1 (1995), 43--59. Google ScholarDigital Library
Alain Darte, Robert Schreiber, B. Ramakrishna Rau, Frederic Vivien, B. Ramakrishna, and Rau Frdric Vivien. 2000. A constructive solution to the juggling problem in systolic array synthesis. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’00). 815--821. Google ScholarDigital Library
Benoît Dupont de Dinechin, Renaud Ayrignac, Pierre-Edouard Beaucamps, Patrice Couvert, Benoit Ganne, Pierre Guironnet de Massas, François Jacquet, Samuel Jones, Nicolas Morey Chaisemartin, Frédéric Riss, and others. 2013. A clustered manycore processor architecture for embedded and accelerated applications. In HPEC. 1--6.Google Scholar
Andrew Duller, Gajinder Panesar, and Daniel Towner. 2003. Parallel processing—The picoChip way&excl;. In Proceedings of Communicating Process Architectures (CPA’03). IOS Press, Enschede, The Netherlands, 125--138.Google Scholar
Hritam Dutta, Frank Hannig, and Jürgen Teich. 2006. Hierarchical partitioning for piecewise linear algorithms. In Proceedings of the 5th International Conference on Parallel Computing in Electrical Engineering (PARELEC’06). IEEE Computer Society, 153--160. Google ScholarDigital Library
Uwe Eckhardt and Renate Merker. 1999. Hierarchical algorithm partitioning at system level for an improved utilization of memory structures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 18, 1 (Jan 1999), 14--24. Google ScholarDigital Library
Frank Hannig, Vahid Lari, Srinivas Boppu, Alexandru Tanase, and Oliver Reiche. 2014. Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach. ACM Transactions on Embedded Computing Systems (TECS) 13, 4s (April 2014), 133:1--133:29. Google ScholarDigital Library
Albert Hartono, Muthu Manikandan Baskaran, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2010. DynTile: Parametric tiled loop generation for parallel execution on multicore processors. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’10). 1--12.Google ScholarCross Ref
Albert Hartono, Muthu Manikandan Baskaran, Cédric Bastoul, Albert Cohen, Sriram Krishnamoorthy, Boyana Norris, J. Ramanujam, and P. Sadayappan. 2009. Parametric multi-level tiling of imperfectly nested loops. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM, 147--157. Google ScholarDigital Library
J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. V. D. Wijngaart, and T. Mattson. 2010. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC’10). 108--109.Google Scholar
Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. 2010. Power7: IBM’s next-generation server processor. IEEE Micro 30, 2 (March--April 2010), 7--15. Google ScholarDigital Library
DaeGon Kim and Sanjay Rajopadhye. 2009. Efficient tiled loop generation: D-tiling. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing (LCPC) (Lecture Notes in Computer Science (LNCS)), Vol. 5898. Springer, 293--307. Google ScholarDigital Library
DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Multi-level tiling: M for the price of one. In SC’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. ACM, New York, NY, 1--12. Google ScholarDigital Library
Dmitrij Kissler, Daniel Gran, Zoran A. Salcic, Frank Hannig, and Jürgen Teich. 2011. Scalable many-domain power gating in coarse-grained reconfigurable processor arrays. IEEE Embedded Systems Letters 3, 2 (June 2011), 58--61. Google ScholarDigital Library
Dmitrij Kissler, Frank Hannig, Alexey Kupriyanov, and Jürgen Teich. 2006. A highly parameterizable parallel processor array architecture. In Proceedings of the IEEE International Conference on Field Programmable Technology (FPT’06). IEEE, 105--112.Google ScholarCross Ref
Mojtaba Mehrara, Thoma B. Jablin, Dan Upton, David I. August, Kim Hazelwood, and Scott Mahlke. 2009. Compilation strategies and challenges for multicore signal processing. IEEE Signal Processing Magazine 26, 6 (Nov. 2009), 55--63.Google ScholarCross Ref
Lakshminarayanan Renganarayanan, DaeGon Kim, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Parameterized tiled loops for free. In Proceedings of the Conference on Programming Language Design and Implementation. 405--414. Google ScholarDigital Library
Lakshminarayanan Renganarayanan, Daegon Kim, Michelle Mills Strout, and Sanjay Rajopadhye. 2010. Parameterized loop tiling. ACM Transactions on Programming Languages and Systems 34, 1 (April 2010), 3:1--3:41. Google ScholarDigital Library
Alexandru Tanase, Michael Witterauf, Jürgen Teich, and Frank Hannig. 2014. Symbolic inner loop parallelisation for massively parallel processor arrays. In Proceedings of the 12th ACM-IEEE International Conference on Formal Methods and Models for System Design (MEMOCODE’14). 219--228. Google ScholarDigital Library
Sanket Tavarageri, Albert Hartono, Muthu Baskaran, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2010. Parametric tiling of affine loop nests. In Proceedings of the 15th Workshop on Compilers for Parallel Computing (CPC). Vienna, Austria.Google Scholar
Jürgen Teich. 1993. A Compiler for Application Specific Processor Arrays. Shaker.Google Scholar
Jürgen Teich. 2008. Invasive algorithms and architectures. IT - Information Technology 50, 5 (2008), 300--310.Google ScholarCross Ref
Jürgen Teich, Jörg Henkel, Andreas Herkersdorf, Doris Schmitt-Landsiedel, Wolfgang Schröder-Preikschat, and Gregor Snelting. 2011. Invasive computing: An overview. In Multiprocessor System-on-Chip -- Hardware Design and Tool Integration. Springer, 241--268.Google Scholar
Jürgen Teich, Alexandru Tanase, and Frank Hannig. 2013. Symbolic parallelization of loop programs for massively parallel processor arrays. In Proceedings of the 24th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP’13). IEEE, 1--9. Google ScholarDigital Library
Jürgen Teich, Alexandru Tanase, and Frank Hannig. 2014. Symbolic mapping of loop programs onto processor arrays. Journal of Signal Processing Systems 77, 1--2 (2014), 31--59. Google ScholarDigital Library
Jürgen Teich, Lothar Thiele, and Li Zhang. 1996. Scheduling of partitioned regular algorithms on processor arrays with constrained resources. In Proceedings of International Conference on Application Specific Systems, Architectures and Processors (ASAP’96). 131--144. Google ScholarDigital Library
Lothar Thiele. 1989. On the design of piecewise regular processor arrays. In Proceedings of the IEEE International Symposium on Circuits and Systems, Vol. 3. 2239--2242.Google ScholarCross Ref
Lothar Thiele and Vwani Prasad Roychowdhury. 1991. Systematic design of local processor arrays for numerical algorithms. In Proceedings of the International Workshop on Algorithms and Parallel VLSI Architectures, Vol. A: Tutorials. Elsevier, Amsterdam, The Netherlands, 329--339.Google Scholar
Tilera Corporation. 2013. Homepage Retrieved February 1, 2013 from http://www.tilera.com.Google Scholar
Tao Yang and Oscar H. Ibarra. 1995. On symbolic scheduling and parallel complexity of loops. In Proceedings of the IEEE Symposium on Parallel and Distributed Processing. 360--367. Google ScholarDigital Library
Tomofumi Yuki and Sanjay Rajopadhye. 2013. Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs. Technical Report. Citeseer.Google Scholar

Index Terms

Symbolic Multi-Level Loop Mapping of Loop Programs for Massively Parallel Processor Arrays

Recommendations

Polyhedral parallel code generation for CUDA
Special Issue on High-Performance Embedded Architectures and Compilers

This article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static ...
Read More
Automatic Generation of Multi-Objective Polyhedral Compiler Transformations
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

To this day, polyhedral optimizing compilers use either extremely rigid (but accurate) cost models, one-size-fits-all general-purpose heuristics, or auto-tuning strategies to traverse and evaluate large optimization spaces. In this paper, we introduce ...
Read More
Symbolic Loop Compilation for Tightly Coupled Processor Arrays

Tightly Coupled Processor Arrays (TCPAs), a class of massively parallel loop accelerators, allow applications to offload computationally expensive loops for improved performance and energy efficiency. To achieve these two goals, executing a loop on a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Embedded Computing Systems Volume 17, Issue 2
Special Issue on MEMCODE 2015 and Regular Papers (Diamonds)
March 2018
640 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3160927
Editor:
Sandeep K. Shukla
Indian Institute of Technology, India
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 7 December 2017
- Accepted: 1 April 2017
- Revised: 1 September 2016
- Received: 1 January 2016
Published in tecs Volume 17, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Processor arrays
loop programs
mapping
symbolic parallelization
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 172
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Symbolic Multi-Level Loop Mapping of Loop Programs for Massively Parallel Processor Arrays

ACM Transactions on Embedded Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Polyhedral parallel code generation for CUDA

Automatic Generation of Multi-Objective Polyhedral Compiler Transformations

Symbolic Loop Compilation for Tightly Coupled Processor Arrays