Skip to main content
Top

2018 | Book

Symbolic Parallelization of Nested Loop Programs

Authors: Alexandru-Petru Tanase, Frank Hannig, Jürgen Teich

Publisher: Springer International Publishing

insite
SEARCH

About this book

This book introduces new compilation techniques, using the polyhedron model for the resource-adaptive parallel execution of loop programs on massively parallel processor arrays. The authors show how to compute optimal symbolic assignments and parallel schedules of loop iterations at compile time, for cases where the number of available cores becomes known only at runtime. The compile/runtime symbolic parallelization approach the authors describe reduces significantly the runtime overhead, compared to dynamic or just‐in-time compilation. The new, on‐demand fault‐tolerant loop processing approach described in this book protects loop nests for parallel execution against soft errors.

Table of Contents

Frontmatter
Chapter 1. Introduction
Abstract
In 1965, Gordon Moore predicted that the number of transistors per chip would double every two years and that chips would eventually be so small, that they could be embedded in homes, cars, and mobile devices. As this prophecy came true, the efficient exploitation of the computational performance of such systems is achieved through new resource-aware programming paradigms such as of invasive computing. They introduce novel runtime adaptivity that renders compilation difficult because the actual number of executing processors becomes only known at runtime. This is a challenging task, as a just-in-time compiler on multiprocessor systems-on-chip (MPSoC) is often prohibitive due to the restricted memory available on such devices. Moreover, the possibility of dynamic runtime compilation and optimized code generation might be out of reach for reasons of unacceptable time overheads. Therefore, novel compiler support for adaptive parallel execution of programs on processor arrays such as TCPAS is required. At the same time, it is of utmost importance to take countermeasures against the increasing proneness to errors of modern MPSoCs. In safety-critical environments such as avionics and automotive, single-event upset (SEU) might change the current program behavior either temporally or even permanently. Accordingly, novel adaptive approaches to enable fault tolerance according to environmental conditions and/or application requirements are also needed for parallel program execution, where faults may otherwise propagate over multiple resources.
Alexandru-Petru Tanase, Frank Hannig, Jürgen Teich
Chapter 2. Fundamentals and Compiler Framework
Abstract
Heterogeneous systems including power-efficient hardware accelerators are dominating the design of nowadays and future embedded computer architectures—as a requirement for energy-efficient system design. In this context, we discuss the main principles of invasive computing, then, we subsequently present the concept and structure of invasive tightly coupled processor arrays (TCPAs), which form the basis for our experiments throughout the book. For the efficient utilization of an invasive TCPA, through the concrete invasive language InvadeX10, compiler support is paramount. Without such support, programming that leverages the abundant parallelism in such architectures is very difficult, tedious, and error-prone. Unfortunately, even nowadays, there is a lack of compiler frameworks for generating efficient parallel code for massively parallel architectures. In this chapter, we therefore present LoopInvader, the first compiler for mapping nested loop programs onto invasive TCPAs. We furthermore discuss the fundamentals and background of the underlying models for algorithm and application specification.
Alexandru-Petru Tanase, Frank Hannig, Jürgen Teich
Chapter 3. Symbolic Parallelization
Abstract
This chapter presents solutions in the form of compile-time transformations in the polyhedron model for adaptive parallel execution of loop programs on processor arrays. By analytical means, we show for the first time that it is possible to jointly tile and schedule a given loop nest with uniform data dependencies symbolically using: (1) outer loop parallelization for scenarios that are I/O-bounded, (2) inner loop parallelization for scenarios that are memory-bounded. These transformations are essential for self-organizing computing paradigms such as invasive computing, because the claimed region of processors, whose shape and size determine the forms of tiles during parallelization, is not known at compile time. In this realm, it is shown that it is indeed possible to derive parameterized latency-optimal schedules statically by proposing a two-step approach: First, the iteration space of a loop program is tiled symbolically into orthotopes of parametrized extensions. Subsequently, the resulting tiled program is also scheduled symbolically, resulting in a set of latency-optimal parameterized schedule candidates. At runtime, once the size of the processor array becomes known, simple comparisons of latency-determining expressions finally steer which of these schedules will be dynamically selected and the corresponding program configuration executed on the resulting processor array so to avoid any further runtime optimization or expensive recompilation. Having symbolic schedules allows the number of processors to execute on to remain undetermined until runtime. This proposed compile/runtime hybrid approach significantly reduces the runtime overhead compared to dynamic or just-in-time compilation.
Alexandru-Petru Tanase, Frank Hannig, Jürgen Teich
Chapter 4. Symbolic Multi-Level Parallelization
Abstract
Today’s multiprocessor systems-on-chips (MPSoCs) have brought up massively parallel processor array accelerators that may achieve a high computational efficiency by exploiting multiple levels of parallelism and different memory hierarchies. However, existing parallelization techniques are often unable to exploit multiple levels of parallelism and are either I/O or memory bounded. Furthermore, if the number of available processing elements becomes only known at runtime—as in adaptive systems—static approaches fail. In this chapter, we solve these problems by proposing a hybrid compile/runtime multi-level symbolic parallelization technique that is able to: (a) Exploit multiple levels of parallelism as well as (b) different memory hierarchies, and (c) to match the I/O or memory capabilities of the target architecture for scenarios where the number of available processing elements is only known at runtime. Our proposed technique consists of two compile-time transformations: (1) symbolic hierarchical tiling followed by (2) symbolic multi-level scheduling. The tiling levels scheduled in parallel exploit different levels of parallelism, whereas the sequential one, different memory hierarchies. Furthermore, by tuning the size of the tiles on the individual levels, a tradeoff between the necessary I/O-bandwidth and memory is possible, which facilitates obeying resource constraints. The resulting schedules are symbolic with respect to the problem size and tile sizes. Thus, the number of processing elements to map onto does not need to be known at compile time. At runtime, when the number of available processors becomes known, a simple prologue chooses a feasible schedule with respect to I/O and memory constraints that is latency-optimal for the chosen tile size. This approach exploits multiple levels of parallelism, is independent of the problem size of the loop nest and thereby avoids any expensive re-compilation at runtime. This is particularly important for low cost and memory-scarce embedded MPSoC platforms that may not afford to host a just-in-time compiler.
Alexandru-Petru Tanase, Frank Hannig, Jürgen Teich
Chapter 5. On-Demand Fault-Tolerant Loop Processing
Abstract
Since the feature sizes of silicon devices continue to shrink, it is imperative to counter the increasing proneness to errors of modern, complex systems by applying appropriate fault tolerance measures. In this chapter, we therefore propose new techniques that leverage the advantages of self-organizing computing paradigms such as invasive computing to implement fault tolerance on multiprocessor systems-on-chips (MPSoCs) adaptively. We presented new compile time transformations that introduce modular redundancy into a loop program to protect it against soft errors. Our approach uses the abundant number of processing elements (PEs) within a tightly coupled processor array (TCPA) to claim not only one region of a processor array, but instead two (dual modular redundancy (DMR)) or three (triple modular redundancy (TMR)) contiguous neighboring regions of PEs. At the source code level, the compiler realizes these replication schemes with a program transformation that: (1) replicates a parallel loop program two or three times for DMR or TMR, respectively, and (2) introduces appropriate voting operations whose frequency and location may be chosen from three proposed variants. Which variant to choose depends, for example, on the error resilience needs of the application or the expected soft error rates. Finally, we explore the different tradeoffs of these variants regarding performance overheads and error detection latency.
Alexandru-Petru Tanase, Frank Hannig, Jürgen Teich
Chapter 6. Conclusions and Outlook
Abstract
This chapter concludes and summarizes the key contributions of this book, which investigates fundamental loop transformations for the parallel execution of invasive loop programs on massively parallel processor arrays (such as tightly couplet processor arrays (TCPAs)) enabling dynamic runtime compilation. Furthermore, an overview of possible future work in this important research area is outlined.
Alexandru-Petru Tanase, Frank Hannig, Jürgen Teich
Backmatter
Metadata
Title
Symbolic Parallelization of Nested Loop Programs
Authors
Alexandru-Petru Tanase
Frank Hannig
Jürgen Teich
Copyright Year
2018
Electronic ISBN
978-3-319-73909-0
Print ISBN
978-3-319-73908-3
DOI
https://doi.org/10.1007/978-3-319-73909-0