skip to main content
research-article

Symbolic Multi-Level Loop Mapping of Loop Programs for Massively Parallel Processor Arrays

Authors Info & Claims
Published:07 December 2017Publication History
Skip Abstract Section

Abstract

Today’s MPSoCs (multiprocessor systems-on-chip) have brought up massively parallel processor array accelerators that may achieve a high computational efficiency by exploiting multiple levels of parallelism and different memory hierarchies. Such parallel processor arrays are perfect targets, particularly for the acceleration of nested loop programs due to their regular and massively parallel nature. However, existing loop parallelization techniques are often unable to exploit multiple levels of parallelism and are either I/O or memory bounded. Furthermore, if the number of available processing elements becomes only known at runtime—as in adaptive systems—static approaches fail. In this article, we solve some of these problems by proposing a hybrid compile/runtime multi-level symbolic parallelization technique that is able to: (a) exploit multiple levels of parallelism as well as (b) different memory hierarchies, and (c) to match the I/O or memory capabilities of the target architecture for scenarios where the number of available processing elements is only known at runtime. Our proposed technique consists of two compile-time transformations: (a) symbolic hierarchical tiling followed by (b) symbolic multi-level scheduling. The tiling levels scheduled in parallel exploit different levels of parallelism, whereas the sequential one, different memory hierarchies. Furthermore, by tuning the size of the tiles on the individual levels, a tradeoff between the necessary I/O-bandwidth and memory is possible, which facilitates obeying resource constraints. The resulting schedules are symbolic with respect to the problem size and tile sizes. Thus, the number of processing elements to map onto does not need to be known at compile time. At runtime, when the number of available processors becomes known, a simple prologue chooses a feasible schedule with respect to I/O and memory constraints that is latency-optimal for the chosen tile size. In summary, our approach determines the set of feasible, latency-optimal symbolic loop schedule candidates at compile time, from which one is dynamically selected at runtime. This approach exploits multiple levels of parallelism, is independent of the problem size of the loop nest, and thereby avoids any expensive re-compilation at runtime. This is particularly important for low cost and memory-scarce embedded MPSoC platforms that may not afford to host a just-in-time compiler.

References

  1. Alain Darte, Leonid Khachiyan, and Yves Robert. 1992. Linear scheduling is close to optimality. In Proceedings of the 1992 International Conference on Application Specific Array Processors. 37--46.Google ScholarGoogle ScholarCross RefCross Ref
  2. Alain Darte and Yves Robert. 1995. Affine-by-statement scheduling of uniform and affine loop nests over parametric domains. Journal of Parallel and Distributed Computing 29, 1 (1995), 43--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alain Darte, Robert Schreiber, B. Ramakrishna Rau, Frederic Vivien, B. Ramakrishna, and Rau Frdric Vivien. 2000. A constructive solution to the juggling problem in systolic array synthesis. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’00). 815--821. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Benoît Dupont de Dinechin, Renaud Ayrignac, Pierre-Edouard Beaucamps, Patrice Couvert, Benoit Ganne, Pierre Guironnet de Massas, François Jacquet, Samuel Jones, Nicolas Morey Chaisemartin, Frédéric Riss, and others. 2013. A clustered manycore processor architecture for embedded and accelerated applications. In HPEC. 1--6.Google ScholarGoogle Scholar
  5. Andrew Duller, Gajinder Panesar, and Daniel Towner. 2003. Parallel processing—The picoChip way!. In Proceedings of Communicating Process Architectures (CPA’03). IOS Press, Enschede, The Netherlands, 125--138.Google ScholarGoogle Scholar
  6. Hritam Dutta, Frank Hannig, and Jürgen Teich. 2006. Hierarchical partitioning for piecewise linear algorithms. In Proceedings of the 5th International Conference on Parallel Computing in Electrical Engineering (PARELEC’06). IEEE Computer Society, 153--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Uwe Eckhardt and Renate Merker. 1999. Hierarchical algorithm partitioning at system level for an improved utilization of memory structures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 18, 1 (Jan 1999), 14--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Frank Hannig, Vahid Lari, Srinivas Boppu, Alexandru Tanase, and Oliver Reiche. 2014. Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach. ACM Transactions on Embedded Computing Systems (TECS) 13, 4s (April 2014), 133:1--133:29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Albert Hartono, Muthu Manikandan Baskaran, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2010. DynTile: Parametric tiled loop generation for parallel execution on multicore processors. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’10). 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  10. Albert Hartono, Muthu Manikandan Baskaran, Cédric Bastoul, Albert Cohen, Sriram Krishnamoorthy, Boyana Norris, J. Ramanujam, and P. Sadayappan. 2009. Parametric multi-level tiling of imperfectly nested loops. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM, 147--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. V. D. Wijngaart, and T. Mattson. 2010. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC’10). 108--109.Google ScholarGoogle Scholar
  12. Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. 2010. Power7: IBM’s next-generation server processor. IEEE Micro 30, 2 (March--April 2010), 7--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. DaeGon Kim and Sanjay Rajopadhye. 2009. Efficient tiled loop generation: D-tiling. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing (LCPC) (Lecture Notes in Computer Science (LNCS)), Vol. 5898. Springer, 293--307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Multi-level tiling: M for the price of one. In SC’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. ACM, New York, NY, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dmitrij Kissler, Daniel Gran, Zoran A. Salcic, Frank Hannig, and Jürgen Teich. 2011. Scalable many-domain power gating in coarse-grained reconfigurable processor arrays. IEEE Embedded Systems Letters 3, 2 (June 2011), 58--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dmitrij Kissler, Frank Hannig, Alexey Kupriyanov, and Jürgen Teich. 2006. A highly parameterizable parallel processor array architecture. In Proceedings of the IEEE International Conference on Field Programmable Technology (FPT’06). IEEE, 105--112.Google ScholarGoogle ScholarCross RefCross Ref
  17. Mojtaba Mehrara, Thoma B. Jablin, Dan Upton, David I. August, Kim Hazelwood, and Scott Mahlke. 2009. Compilation strategies and challenges for multicore signal processing. IEEE Signal Processing Magazine 26, 6 (Nov. 2009), 55--63.Google ScholarGoogle ScholarCross RefCross Ref
  18. Lakshminarayanan Renganarayanan, DaeGon Kim, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Parameterized tiled loops for free. In Proceedings of the Conference on Programming Language Design and Implementation. 405--414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lakshminarayanan Renganarayanan, Daegon Kim, Michelle Mills Strout, and Sanjay Rajopadhye. 2010. Parameterized loop tiling. ACM Transactions on Programming Languages and Systems 34, 1 (April 2010), 3:1--3:41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Alexandru Tanase, Michael Witterauf, Jürgen Teich, and Frank Hannig. 2014. Symbolic inner loop parallelisation for massively parallel processor arrays. In Proceedings of the 12th ACM-IEEE International Conference on Formal Methods and Models for System Design (MEMOCODE’14). 219--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sanket Tavarageri, Albert Hartono, Muthu Baskaran, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2010. Parametric tiling of affine loop nests. In Proceedings of the 15th Workshop on Compilers for Parallel Computing (CPC). Vienna, Austria.Google ScholarGoogle Scholar
  22. Jürgen Teich. 1993. A Compiler for Application Specific Processor Arrays. Shaker.Google ScholarGoogle Scholar
  23. Jürgen Teich. 2008. Invasive algorithms and architectures. IT - Information Technology 50, 5 (2008), 300--310.Google ScholarGoogle ScholarCross RefCross Ref
  24. Jürgen Teich, Jörg Henkel, Andreas Herkersdorf, Doris Schmitt-Landsiedel, Wolfgang Schröder-Preikschat, and Gregor Snelting. 2011. Invasive computing: An overview. In Multiprocessor System-on-Chip -- Hardware Design and Tool Integration. Springer, 241--268.Google ScholarGoogle Scholar
  25. Jürgen Teich, Alexandru Tanase, and Frank Hannig. 2013. Symbolic parallelization of loop programs for massively parallel processor arrays. In Proceedings of the 24th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP’13). IEEE, 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jürgen Teich, Alexandru Tanase, and Frank Hannig. 2014. Symbolic mapping of loop programs onto processor arrays. Journal of Signal Processing Systems 77, 1--2 (2014), 31--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jürgen Teich, Lothar Thiele, and Li Zhang. 1996. Scheduling of partitioned regular algorithms on processor arrays with constrained resources. In Proceedings of International Conference on Application Specific Systems, Architectures and Processors (ASAP’96). 131--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lothar Thiele. 1989. On the design of piecewise regular processor arrays. In Proceedings of the IEEE International Symposium on Circuits and Systems, Vol. 3. 2239--2242.Google ScholarGoogle ScholarCross RefCross Ref
  29. Lothar Thiele and Vwani Prasad Roychowdhury. 1991. Systematic design of local processor arrays for numerical algorithms. In Proceedings of the International Workshop on Algorithms and Parallel VLSI Architectures, Vol. A: Tutorials. Elsevier, Amsterdam, The Netherlands, 329--339.Google ScholarGoogle Scholar
  30. Tilera Corporation. 2013. Homepage Retrieved February 1, 2013 from http://www.tilera.com.Google ScholarGoogle Scholar
  31. Tao Yang and Oscar H. Ibarra. 1995. On symbolic scheduling and parallel complexity of loops. In Proceedings of the IEEE Symposium on Parallel and Distributed Processing. 360--367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Tomofumi Yuki and Sanjay Rajopadhye. 2013. Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs. Technical Report. Citeseer.Google ScholarGoogle Scholar

Index Terms

  1. Symbolic Multi-Level Loop Mapping of Loop Programs for Massively Parallel Processor Arrays

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Embedded Computing Systems
          ACM Transactions on Embedded Computing Systems  Volume 17, Issue 2
          Special Issue on MEMCODE 2015 and Regular Papers (Diamonds)
          March 2018
          640 pages
          ISSN:1539-9087
          EISSN:1558-3465
          DOI:10.1145/3160927
          Issue’s Table of Contents

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 December 2017
          • Accepted: 1 April 2017
          • Revised: 1 September 2016
          • Received: 1 January 2016
          Published in tecs Volume 17, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader