parallel_for
or pipeline
(see Chapter 2), that are ready to use when our algorithms fit into these particular patterns. But what if our algorithm is not that simple? Or what if the available high-level abstraction is not squeezing out the last drop of performance of our parallel hardware? Should we just give up and remain prisoners of the high-level features of the programing model? Of course not! There should be a capability to get closer to the hardware, a way to build our own templates from the ground up, and a way to thoroughly optimize our implementation using low-level and more tunable characteristics of the programming model. And in TBB, this capability exists. In this chapter, we will focus on one of the most powerful low-level features of TBB, the task programming interface. As we have said throughout the book, tasks are at the heart of TBB, and tasks are the building blocks used to construct the high-level templates such as parallel_for
and pipeline
. But there is nothing that prevents us from venturing into these deeper waters and starting to code our algorithms directly with tasks, from building our own high-level templates for future use on top of tasks, or as we describe in the next chapters, from fully optimizing our implementation by fine tuning the way in which tasks are executed. In essence, this is what you will learn by reading this chapter and the ones that follow. Enjoy the deep dive!A Running Example: The Sequence
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, ...
nth
number in the sequence, F
n
, can be computed recursively asF
0
=0
and F
1
=1
. There are several algorithms that compute F
n
, but in the interest of illustrating how TBB tasks work, we chose the one presented in Figure 10-1, although it is not the most efficient one.
F
n
= F
n-1
+ F
n-2
. While it may be easy to understand, we clarify it further in Figure 10-2 where we depict the recursive calling tree when calling fib(4)
.
if (n<2)
line at the beginning of the serial code of Figure 10-1 caters for what is called the base case, that is always needed in recursive codes to avoid infinite recursion, which is nice because we don’t want to nuke the stack, do we?The High-Level Approach: parallel_invoke
parallel_invoke
. Relying on this class, we can come up with a first parallel implementation of the Fibonacci algorithm that we present in Figure 10-3.
parallel_invoke
member function recursively spawns parallel_fib(n-1)
and parallel_fib(n-2)
returning the result in stack variables x
and y
that are captured by reference in the two lambdas. When these two tasks finish, the caller task simply returns the sum of x+y
. The recursive nature of the implementation keeps invoking parallel tasks until the base case is reached when n<2
. This means that TBB will create a task even for computing parallel_fib(1)
and parallel_fib(0)
, that just return 1
and 0
respectively. As we have said throughout the book, we want to expose enough parallelism to the architecture creating a sufficiently large number of tasks, but at the same time tasks must also have a minimum degree of granularity (>1
microsecond, as we discuss in Chapters 16 and 17) so that task creation overhead pays off. This trade-off is usually implemented in this kind of algorithm using a “cutoff
” parameter as we show in Figure 10-4.
n
is not large enough (n<cutoff
), and in this case we resort to the serial execution. Computing a suitable cutoff
value requires some experimentation so it is advisable to write our code so that cutoff
can be an input parameter to ease the search of a suitable one. For example, in our test bed, fib(30)
only takes around 1 millisecond so this is a fine-grained-enough task to discourage further splitting. Thus, it makes sense to set cutoff=30
, which results in calling the serial version of the code for tasks that receive n=29
and n=28
, as we can see in Figure 10-5.
fib(29
) in three different tasks and fib(28)
in two additional ones, you are right, it is silly! As a disclaimer, we already said that this is not the optimal implementation but a commonly used recursive example that serves our educational interests. A clear optimization would be to implement recursion in a manner such that already computed Fibonacci numbers are not recomputed again, thus achieving the optimal O(log n)
complexity, but this is not our goal today.parallel_invoke
that was already covered way back in Chapter 2. Have we really reached the second, more advanced, section of this book? Yes! Well… where are the advanced features, the low-level knobs that we may need and the optimization opportunities that we love??? Okay, let’s start diving into deeper waters!!!The Highest Among the Lower: task_group
task_group
class can serve us well. This is a higher-level and easier to use class, a medium-level abstraction, if you will. A possible re-implementation of the Fibonacci code that relies on task_group
is presented in Figure 10-6.
parallel_invoke
. However, we would like to underscore that, differently from the parallel_invoke
alternative, now we have a handle to a group of tasks, g
, and as we will discuss later, this enables some additional possibilities as task cancellation. Also, by explicitly calling the member functions g.run()
and g.wait()
, we spawn the new tasks and wait for them to finish their computation at two different program points, whereas the parallel_invoke
function has an implicit task barrier after the task spawning. To begin with, this separation between run()
and wait()
would allow for the caller thread to do some computation between spawning some tasks and waiting for them in the blocking call wait()
. In addition, this class also offers other interesting member functions that can come handy in some situations:-
void run_and_wait( const Func& f )
, which is equivalent to{run(f); wait();}
, but guarantees thatf
runs on the current thread. We will see later (in section “The Low-Level Task Interface: Part Two – Task Continuation”) that there is a convenient trick to bypass the TBB scheduler. If we first callrun(f)
, we basically spawn a task that gets enqueued in the worker thread local queue. When callingwait()
, we call the scheduler that dequeues the just enqueued task if nobody else has stolen it in the meantime. The purpose ofrun_and_wait
is twofold: first, we avoid the overhead of enqueueing-scheduling-dequeuing steps, and second, we avoid the potential stealing that can happen while the task is in the queue. -
void cancel()
, which cancels all tasks in thistask_group
. Maybe the computation was triggered from a user interface, UI, that also includes a “cancel” button. If the user now presses this button, there is a way to stop the computation. In Chapter 15, we further elaborate on cancellation and exception handling. -
task_group_status wait()
, which returns the final status of task group. Return values can be:complete
(all tasks in the group have finished);canceled
(task_group
received a cancellation request);not_completed
(not all tasks in the group have completed).
parallel_fib
creates a new task_group
so it is possible to cancel one branch without affecting the others, as we will see in Chapter 15. Having a single task_group
also poses an additional downside: creating too many tasks in that single group results in task creation serialization and the ensuing loss of scalability. Consider for example we are tempted to write a code like this:n
tasks will be spawn one after the other and by the same thread. The other worker threads will be forced to steal every task created by the one executing g.run()
. This will certainly kill the performance, especially if foo()
is a fine-grained task and the number of worker threads, nth
, is high. The recommended alternative is the one used in Figure 10-6 where a recursive deployment of tasks is exercised. In that approach, the worker threads steal at the beginning of the computation, and ideally, in log2
(nth)
steps all nth
worker threads are working in their own tasks that in turn enqueue more tasks in their local queues. For example, for nth=4
, the first thread, A
, spawns two tasks and starts working in one while thread B
steals the other. Now, threads A
and B
spawn two tasks each (four in total) and start working in two of them, but the other two are stolen by threads C
and D
. From now on, all four threads are working and enqueueing more tasks in their local queues and stealing again only when they run out of local tasks.parallel_pipeline
and Flow Graph can better exploit locality thanks to a scheduling-bypassing technique. Here, we explain how this technique works and why it is beneficial.The Low-Level Task Interface: Part One – Task Blocking
new
and task::allocate_root
member function. The _root
suffix in the name denotes the fact that the task created has no parent. It is the root of a task tree.FibTask{n,&sum}
(the task definition is presented in the next figure), invoked by new
. When the task is run in step 3, it computes the nth
Fibonacci number and stores it into sum
.task::spawn_root_and_wait
.FibTask
defined in Figure 10-8. This is a relatively larger piece of code, compared to fib
and the two previous parallel implementations of parallel_fib
. We were advised, this is a lower-level implementation, and as such it is not as productive or friendly as a high-level abstraction. To make up for the extra burden, we will see later how this class allows us to get our hands dirty tweaking under the hood and tuning the behavior and performance at our will.FibTask
is derived from the class tbb::task
. The fields n
and sum
hold the input value and the pointer to the output, respectively. These are initialized with the arguments passed to the constructor FibTask(long n_, long ∗sum_)
.execute
member function does the actual computation. Every task must provide a definition of execute
that overrides the pure virtual member function tbb::task::execute
. The definition should do the work of the task and return either nullptr
or a pointer to the next task to run, as we saw in Figure 9-14. In this simple example, it returns nullptr
.FibTask::execute()
does the following:n<cutoff
and resorts to the sequential version in this case.F
n-1
and F
n-2
respectively. Here, the inherited member function allocate_child()
is used to allocate space for the task. Remember that the top-level routine parallel_fib
used allocate_root()
to allocate space for a task. The difference is that here the task is creating child tasks. This relationship is indicated by the choice of allocation method. The different allocation methods are listed in Appendix B, Figure B-76.set_ref_count(3)
. The number 3 represents the two children and an additional implicit reference that is required by the member function spawn_and_wait_for_all
. This set_ref_count
member function initializes the ref_count
attribute of each TBB task. Each time a child ends its computation, it decrements the ref_count
attribute of its parent. Make sure to call set_reference_count(k+1)
before spawning the k
children if the task uses wait_for_all
to be resumed after the children complete. Failure to do so results in undefined behavior. The debug version of the library usually detects and reports this type of error.tbb::task::spawn(b)
member function, returns immediately without waiting for the child task to start executing. The second spawning, by the member function tbb::task::spawn_and_wait_for_all(a)
, is equivalent to tbb::task::spawn(a); tbb::task::wait_for_all()
. The last member function causes the parent to wait until all currently allocated child tasks are finished. For this reason, we say this implementation follows what we call a task blocking style.ref_count
attribute of the parent task has been decremented twice and now its value is one. This causes the parent task to resume just after the spawn_and_wait_for_all(a)
call, so it computes x+y
and stores it in ∗sum
.root_task FibTask(8, &sum)
having set cutoff=7
. Assuming a single thread executing all the tasks, and some simplification in the way the stack is used, in Figure 10-9 we have a streamlined representation of the computations carried out. When parallel_fib(8)
is invoked, the variable sum
is stored in the stack, and the root task is allocated on the heap and constructed with FibTask(8, &sum)
. This root task is executed by a worker thread which runs the overridden execute()
member function. Inside this member function, two stack variables x
and y
are declared, and two new child tasks, a
and b
, are allocated and enqueued in the worker thread’s local deque. In the constructor of these two tasks, we pass FibTask(7, &x)
and FibTask(6, &y)
, which means that the variable member sum
of the newly created tasks will point to FibTask(8)
stack variables x
and y
, respectively.
execute()
continues by setting ref_count
of the task to 3
, spawning first b
and then a
and waiting for both. At this point, the root task is suspended until it has no pending child. Remember that this is the task blocking style. The worker thread returns to the scheduler, where it will first dequeue task a
(because it was enqueued last). This task a (FibTask(7,&x))
will recursively repeat the same process, suspending itself after allocating a new x
and y
on the stack and spawning FibTask(5,&x)
and FibTask(6,&y)
. Since cutoff=7
, these two new tasks will resort to the base case and call fib(5)
and fib(6)
, respectively. FibTask(6,&x)
is dequeued first, writes 8
to ∗sum
(where sum
points to x
in FibTask(7)
stack), and returns nullptr
. Then, the FibTask(6,&x)
is destroyed, but in the process, the ref_cont
variable of the parent task (FibTask(7,&x))
is first decremented. The worker thread then dequeues FibTask(5,&y)
that writes 5 to ∗sum
(now alias of y
in the stack) and also returns nullptr
. This results in ref_count
reaching the value 1
, which wakes up the parent thread FibTask(7,&x)
that just has to add 5+8
, write it to ∗sum
(alias of x
in FibTask(8)
) stack, and return nullptr
. This decrements ref_count
of the root task to 2. Next, the worker thread dequeues FibTask(6,&y)
that calls fib(6)
, writes y=8
on the stack, returns, and dies. This finally leaves the root task without children (ref_count=1)
so it can continue the execution just after the spawn_and_wait_for_all()
member function, add 8+13
, write to ∗sum
(alias of sum in the stack of parallel_fib
), and get destroyed. If you are exhausted after reading the description of all this process, so are we, but there is even a bit more so hold on for one more second. Now, imagine that there is more than one worker thread. Each one will have its own stack and fight to steal tasks. The result, 21
, will be the same and, in essence, the same tasks will be executed, though now we don’t know which thread will take care of each task. What we do know is that if the problem size and the number of tasks are large enough and if the cutoff
is wisely set, then this parallel code will run faster than the sequential one.nullptr
. Method tbb::task::parent()
gives you read-only access to the successor pointer. Each task has a ref_count
that accounts for the number of tasks that have it as a successor (i.e., the number of children that the parent has to wait for before it can be triggered for execution).parallel_invoke
and task_group
classes, but at higher programming cost. Then, where is the bang for the buck? The task class has more member functions that will be covered soon, and the implementation discussed in this section is just the foundation on which more optimized version will be built. Stick with us and keep reading.The Low-Level Task Interface: Part Two – Task Continuation
wait_for_all()
call in a parent task. There is no point in wasting this worker thread waiting for the children tasks to finish, so we entrust it with the execution of other tasks. This means that when the parent task is ready to run again, the original worker thread that was taking care of it may be distracted with other duties and cannot respond immediately.continue_node
(and other nodes with potentials for scheduler bypass). The power of continuations (and task recycling, which we cover next) is worth knowing as a parallel programmer – you’ll never want to let a task sit around waiting again (and wasting precious resources)!FibTask
to implement the continuation-passing style.
FibCont
also has an execute()
member function, but now it only includes the code that has to be done once the children tasks are complete. For our Fibonacci example, after the children completion, we only need to add the results that they bring and return, and these are the only two lines of code after the spawn_and_wait_for_all(a)
in the code of Figure 10-8. The continuation task declares three member variables: a pointer to the final sum
and the partial sum from the two children, x
and y
. The constructor FibCont(long∗ sum)
initializes the pointer adequately. Now we have to modify our FibTask
class to properly create and initialize the continuation task FibCont
.
x
and y
private variables are not declared anymore and have been commented out. However, there is now a new task c
of type FibCont&
. This task is allocated using the allocate_continuation()
function that is similar to allocate_child()
, except that it transfers the parent reference of the calling task (this)
to c
and sets the parent attribute of this
to nullptr
. The reference count, ref_count
, of this
’s parent does not change since the parent still has the same number of children, although one of the children has suddenly been mutated from the FibTask
type to the FibCont
one. If you are a happy parent, don’t try this at home!FibTask
is still alive but we will do away with it soon. FibTask
does not have a parent anymore, but it is in charge of some chores before dying. FibTask
first creates two FibTask
children, but watch out!-
The new tasks
a
andb
are now children ofc
(not ofthis
) because we allocate them usingc.allocate_child()
instead of justallocate_child()
. In other words,c
is now the successor of botha
andb
. -
The result of the children is not written in stack-stored variables any more. When initializing
a
, the constructor invoked now isFibTask(n-1,
&c.
x)
, so the pointersum
in the child taska
(a.sum
) is actually pointing toc.x
. Likewise,b.sum
points toc.y
. -
The reference count of
c
(sort of an internal and privatec.ref_count
) is only set to two (c.set_ref_count(2)
) sinceFibCont c
has actually only two children (a
andb
).
a
and b
are ready to be spawned, and that’s all the duties of FibTask
. Now it can die in peace, and the memory it was occupying can be safely reclaimed. R.I.P.A
spawns k
children and waits for them using the wait_for_all
member function, A.ref_count
has to be set to k+1
. The extra “1
” accounts for the additional work that task A
has to finish before ending and dispatching A
’s parent. This extra “1
” is not needed when following the continuation-passing style because A transfers the additional work to the continuation task C
. In this case, C.ref_count
has to be set exactly to k
if it has k
children.FibTask(8,&sum)
has already created the continuation FibCont(sum)
and tasks FibTask(7,&c.x)
and FibTask(6,&c.y)
, which are actually children of FibCont
. In the stack, we see that we are only storing the final result sum that is local to parallel_fib
function because x
and y
are not using stack space using this style. Now, x
and y
are member variables of FibCont
and are stored in the heap. In the bottom part of this figure, we see that the original root task has disappeared with all the memory that it was using. In essence, we are trading stack space for heap space and FibTask
’s objects by FibCont
’s ones, which is beneficial if FibCont
objects are smaller. We also see that the parent reference from FibTask(7,&c.x)
to the root FibCont(&sum)
has been transferred to the younger FibCont
.
FibTask
objects any more. Child tasks FibTask(6,&c.x)
and FibTask(5,&c.y)
have resorted to the base case (n<cutoff
, assuming cutoff=7
) and are about to return after having written ∗sum
with 8 and 5, respectively. Each one of the children will return nullptr
, so the worker thread takes control again and goes back to the work-stealing scheduler, decrements ref_count
of the parent task, and checks whether or not ref_count
is 0. In such a case, following the high-level description of the TBB task dispatch loop presented in Figure 9-14 of Chapter 9, the next task to take is the parent one (FibCont
in this case). Contrary to the blocking style, this is now performed right away. In the bottom part of Figure 10-13, we see the two children of the original root task that have already written their results.
parallel_fib
function is still waiting in the spawn_root_and_wait(a)
to the first root task that was created, since this original FibTask
was replaced by the first FibCont
object and then died (see Figure 10-12). Well, indeed parallel_fib
is still waiting because spawn_root_and_wait
is designed to work correctly with continuation-passing style. An invocation of spawn_root_and_wait(x)
does not actually wait for x
to complete. Instead, it constructs a dummy successor of x
and waits for the successor’s ref_count
to become 0
. Because allocate_continuation
forwards the parent reference to the continuation, the dummy successor’s ref_count
is not decremented until the continuation FibCont
completes. This is illustrated in Figure 10-14.Bypassing the Scheduler
FibTask::execute()
returns, by the getting rules of the work-stealing scheduler described in Chapter 9, task a
is always the next task taken from the ready pool because it was the last one being spawned (unless it has been stolen by another worker thread). More precisely, the sequence of events is as follows:-
Push task
a
onto the thread’s deque. -
Return from member function
execute()
. -
Pop task
a
from the thread’s deque, unless it is stolen by another thread.
execute
does not spawn the task but instead returns a pointer to it as the result. This approach guarantees that the same worker thread immediately executes a
, not some other thread. To that end, in the code of Figure 10-11, we need to replace these two lines as follows:spawn(a);
return nullptr; | ➜ | //spawn(a); commented out! return &a; |
The Low-Level Task Interface: Part Three – Task Recycling
a
is now the recycled this
. The call recycle_as_child_of(c)
has several effects:-
It marks
this
not to be automatically destroyed when execute returns. -
It sets the successor of
this
to bec
. To prevent reference-counting problems,recycle_as_child_of
has a prerequisite thatthis
must have anullptr
successor (this
’s parent reference should point tonullptr
). This is the case afterallocate_continuation
occurs.
FibTask(n-1,&c.x)
. In this case, this->n
is decremented (n -=1;
), and this->sum
is initialized to point to c.x
.this
’s member variables are not used in the current execution of the task after the recycled task is spawned. This is the case in our example since the recycled task is actually not spawned and will only run after returning the pointer this
. You can spawn the recycled task instead (i.e., spawn (∗this); return nullptr;
), as long as none of its member variables is used after the spawning. This restriction applies even to const
member variables, because after the task is spawned, it might run and be destroyed before the parent progresses any further. A similar member function, task::recycle_as_continuation()
, recycles a task as a continuation instead of as a child.FibTask(8,&sum)
as child of FibCont
once the child has updated the member variables (8
becomes 7
and sum points to c.x
).
continue_node
(and other nodes with potentials for scheduler bypass). In the next section, we present an example in which we leverage the low-level task API and evaluate its impact, but before that, check out our “checklist.”Task Interface Checklist
parallel_for
, parallel_reduce
, and so on do. If this is not the case and you need the extra performance offered by the task API, some of the details to remember are-
Always use
new(allocation_method) T
to allocate a task, whereallocation_method
is one of the allocation methods of class task (see Appendix B, Figure B-76). Do not create local or file-scope instances of a task. -
All siblings should be allocated before any start running, unless you are using
allocate_additional_child_of
. We will elaborate on this in the last section of the chapter. -
Exploit continuation passing, scheduler bypass, and task recycling to squeeze out maximum performance.
-
If a task completes, and was not marked for re-execution (recycling), it is automatically destroyed. Also, its successor’s reference count is decremented, and if it hits zero, the successor is automatically spawned.
One More Thing: FIFO (aka Fire-and-Forget) Tasks
FifoTask
class derives from tbb::task
and overrides the execute()
member function as every normal task does. The four differences with spawned tasks are-
A spawned task can be postponed by the scheduler until it is waited upon, but an enqueued task will be eventually executed even if there is no thread explicitly waiting on the task. Even if the total number of worker threads is zero, a special additional worker thread is created to execute enqueued tasks.
-
Spawned tasks are scheduled in a LIFO like order (most recently spawned is started next), but enqueued tasks are processed in roughly (not precisely) FIFO order (started in approximately the order they entered the queue – the “approximation” gives TBB some flexibility to be more efficient than a strict policy would allow).
-
Spawned tasks are ideal for recursive parallelism in order to save memory space thanks to a depth-first traversal, but enqueued tasks can prohibitively consume memory for recursive parallelism since the recursion will expand in a breadth-first traversal.
-
Spawned parent tasks should wait for their spawned children to complete, but enqueued tasks should not be waited upon because other enqueued tasks from unrelated parts of the program might have to be processed first. The recommended pattern for using an enqueued task is to have it asynchronously signal its completion. Essentially, enqueued tasks should be allocated as root, instead of as children that are then waited upon.
Putting These Low-Level Features to Work
nxn
2D grid. Each cell has a data dependence with two elements of the adjacent cells. For example, in Figure 10-17(a), we see that cell (2,3) depends on the north (1,3) and west (2,2) ones, since on each iteration of the i
and j
loops, cells that were calculated in previous iterations are needed: A[i,j]
depends on A[i-1,j]
(north dependency) and A[i,j-1]
(west dependency). In Figure 10-18, we show the sequential version of the computation where array A has been linearized. Clearly, the antidiagonal cells are totally independent, so they can be computed in parallel. To exploit this parallelism (loops “i
” and “j
”), a task will carry out the computations corresponding to each cell inside the iteration space (or task space from now on), and independent tasks will be executed in parallel.
foo
at each (i,j
) cell of the matrix. Without loss of generality, we assume that the computational load for each cell will be controlled by the gs
(grainsize) parameter of the foo
function. That way, we can define the granularity of the tasks, and therefore, we can study the performance of different implementations depending on the task granularity, as well as situations with homogeneous or heterogeneous task workloads.(1, 1
), which does not depend on any other task, two new tasks can be dispatched (the one below (2, 1)
and the one to the right (1, 2)
). This dependence information can be captured by a 2D matrix with counters, like the one we show in Figure 10-17(b). The value of the counters points out to how many tasks we have to wait for. Only the tasks with the corresponding counter nullified can be dispatched.wavefront_v0_DAG.cpp
. However, that version requires that all the tasks are preallocated beforehand and the implementation that we present next is more flexible and can be tuned to better exploit locality as we will see later. In Figure 10-19, we show our first task-based implementation that we call wavefront_v1_addchild
. Each ready task first executes the task body, and then it will decrement the counters of the tasks depending on it. If this decrement operation ends up with a counter equal to 0, the task is also responsible of spawning the new independent task. Note that the counters are shared and will be modified by different tasks that are running in parallel. To account for this issue, the counters are atomic variables (see Chapter 5).
allocate_additional_child_of(∗parent())
as the allocation method for the new tasks. By using this allocation method, we can add children while others are running. On the positive side, this allow us to save some coding that would have been necessary to ensure that all child tasks are allocated before any of them is spawned (since this depends on whether the east task, the south, or both are ready to be dispatched). On the negative side, this allocation method requires that the parent’s ref_count
is atomically updated (incremented when one “additional_child
” is allocated and decremented when any child dies). Since we are using allocate_additional_child_of(∗parent())
, all created tasks will be children of the same parent. The first task of the task space is task (1, 1)
, and it is spawned withref_count
of the dummy task.allocate_additional_child_of
allocation method is that the user (we) has to ensure that the parent’s ref_count
does not prematurely reach 0
before the additional child is allocated. Our code already accounts for this eventuality since a task, t
, allocating an additional child, c
, is already guaranteeing that the t
parent’s ref_count
is at least one since t
will only decrement its parent’s ref_count
when dying (i.e., after allocating c)
.parallel_do_feeder
template was already presented to illustrate a different wavefront application: the forward substitution. This template essentially implements a work-list algorithm, in such a way that new tasks can be added dynamically to the work-list by invoking the parallel_do_feeder::add()
member function. We call wavefront_v2_feeder
to a version of the wavefront code that relies on parallel_do_feeder
and, as in Figure 2-19 in Chapter 2, uses feeder.add()
instead of the spawn calls in Figure 10-19.ref_count
, we can implement a more elaborated version that mimics the blocking style explained earlier. Figure 10-20 shows the execute()
member function in this case, where we first annotate whether the east, south, or both cells are ready to dispatch and then allocate and dispatch the corresponding tasks. Note that now we use the allocate_child()
method, and each task has at most two descendants to wait upon. Although the atomic update of a single ref_count
is not a bottleneck anymore, more tasks are in flight waiting for their children to finish (and occupying memory). This version will be named wavefront_v3_blockstyle.
spawn()
, as well as saving the time for getting new tasks from the local queue. The resulting version is called wavefront_v4_recycle
, and the main advantage is that it reduces the number of spawns from n x n
− 2n
(the number of spawns in previous versions) to n
− 2
(approximately the size of a column). See the companion source code to have a look at the complete implementation.wavefront_v5_locality
version that features this optimization. We set the flag recycle_into_east
if there is a ready to dispatch task to the east of the executing task. Otherwise, we set the flag recycle_into_south
, if the south task is ready to dispatch. Later, according to these flags, we recycle the current task into the east or south tasks. Note that, since in this example the data structure is stored by rows, if both east and south tasks are ready, the data cache can be better exploited by recycling into the east task. That way, the same thread/core executing the current task is going to take care of the task traversing the neighbor data, so we make the most out of spatial locality. So, in that case, we recycle into the east task and spawn a new south task that would be executed later.
n
, g
s, A
, and counters
) in global variables. This alternative is implemented in wavefront_v6_global
, and it is provided in the directory with the source code of this chapter’s examples.gs
that sets the number of floating-point operations per task, we found that for coarse-grained tasks that execute more than 2000 floating-point operations (FLOPs), there is not too much difference between the seven versions and the codes scale almost linearly. This is because the parallel overheads vanish in comparison with the large enough time needed to compute all the tasks. However, it is difficult to find real wavefront codes with such coarse-grained tasks. In Figure 10-22, we show the speedup achieved by versions 0 to 5 on a quad-core processor, more precisely, a Core i7-6700HQ (Skylake architecture, 6th generation) at 2.6 GHz, 6 MB L3 cache, and 16 GB RAM. Grain size, gs
, is set to only 200 FLOPs and n=1024
(for this n
, version 6 performs as version 5).
Summary
parallel_invoke
. We then started diving into deeper waters by using a medium-level API provided by the task_group
class. It is though the task interface the one offering the larger degree of flexibility to cater for our specific optimization needs. TBB tasks are underpinning the other high-level templates that were presented in the first part of the book, but we can also get our hands on them to build our own patterns and algorithms, leveraging continuation passing, scheduler bypassing, and task recycling advanced techniques. For even more demanding developers, more possibilities are available thanks to task priorities, task affinity, and task enqueue features that will we cover in the next chapter. We can’t wait to see what you can create and develop out of these powerful tools that are now in your hands.For More Information
-
A. Dios, R. Asenjo, A. Navarro, F. Corbera, E.L. Zapata, A case study of the task-based parallel wavefront pattern, Advances in Parallel Computing: Applications, Tools and Techniques on the Road to Exascale Computing, ISBN: 978-1-61499-040-6, Vol. 22, pp. 65–72, IOS Press BV, Amsterdam, 2012 (extended version available here:
www.ac.uma.es/~compilacion/publicaciones/UMA-DAC-11-02.pdf
). -
A. Dios, R. Asenjo, A. Navarro, F. Corbera, E.L. Zapata High-level template for the task-based parallel wavefront pattern, IEEE Intl. Conf. on High Performance Computing (HiPC 2011), Bengaluru (Bangalore), India, December 18–21, 2011. Implement a high-level template on top of TBB task to ease the implementation of wavefront algorithms.
-
González Vázquez, Carlos Hugo, Library-based solutions for algorithms with complex patterns of parallelism, PhD report, 2015.
http://hdl.handle.net/2183/14385
. Describes three complex parallel patterns and addresses them by implementing high-level templates on top of TBB tasks. -
Intel TBB Design Patterns:
-
GUI thread:
http://software.intel.com/en-us/node/506119
-
Priorities:
http://software.intel.com/en-us/node/506120
-
Wavefront:
http://software.intel.com/en-us/node/506110
-