Erschienen in:

Open Access 2019 | OriginalPaper | Buchkapitel

19. Flow Graphs on Steroids: OpenCL Nodes

verfasst von : Michael Voss, Rafael Asenjo, James Reinders

Erschienen in: Pro TBB

Verlag: Apress

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

Does the async_node leave you yearning for more? If so, this is your chapter. Here, we will cover a high-level Flow Graph class, the opencl_node, that strives to hide the hardware details and programing nuances of OpenCL capable devices. Why OpenCL? Well, there are many reasons and to name a few: OpenCL is an open standard contributed to by members of a large consortium, it is designed to be a platform-independent API, and it aims to nimbly evolve to meet newer requirements. For instance, OpenCL has been an extension of C (not C++), but the latest OpenCL 2.2 version adds support for a subset of C++14 including classes, lambda expressions, templates, and so on.

Is that not enough? Okay, one more. For us, the mother of all reasons, the one standing out among the others, is the number and variety of platforms on which you can use OpenCL. Starting with the notebook and desktop segment, more than 95% of shipped processors for these systems include an OpenCL capable integrated GPU (usually from Intel or AMD). In the mobile segment, at the heart of most smart phones and tablets, we find a System-on-Chip, SoC, featuring a GPU that supports OpenCL (and yes, from the TBB repository we can get the TBB binaries for Android too). These examples seem to be convincing enough, but there is more! In the embedded arena, for many years running, we have been able to buy and exploit heterogeneous boards including an OpenCL programmable FPGA (from Intel-Altera and Xilinx). In the server domain, at the time of writing these lines, Intel is targeting data centers with both FPGA PCIe cards and Intel Xeon Scalable Processor 6138P that includes on-chip an Intel-Altera Arria 10 FPGA, and of course, OpenCL is one of the supported programming models. Moreover, OpenCL code can also run on many CPUs and other kind of accelerators, like the Xeon Phi.

But if OpenCL does not suit your needs, TBB architects also considered the possibility of supporting other programming models. They abstracted away the low-level details of the accelerator programming model into a module called factory. In fact, the opencl_node is the result of instantiating a general class called streaming_node with a particular factory. Then, a factory defines the necessary methods to upload/download data to/from the accelerator and to launch the kernel. That is, the opencl_node is the result of marrying the streaming_node class to an OpenCL factory. Newer programming models can be supported just by developing the corresponding factory.

Now, this is quite a long chapter that covers several concepts (opencl_node, opencl_program, opencl_device, arguments and range of the OpenCL kernel, sub-buffers, etc.) therefore implying a steep learning curve. But we will start easy, on the plains, and progressively ascend to the more complex classes and examples (as we always try to do). As we depict in Figure 19-1, we will start with a simple “Hello World”-like example that uses the opencl_node, to later implement the same triad vector computation that was presented in the previous chapter, but now with our new high-level toy. If you want to save the final climb to the summit, you can stop reading there. If on the other hand you are an experienced climber, at the end of the chapter we give a sneak peek at more advanced features, like fine-tuning the OpenCL NDRange and kernel specification.

Hello OpenCL_Node Example

This time let’s start at the end. This is the output of our first example:


                  Hello OpenCL_Node
                  Bye! Received from: OPENCL_NODE

These two lines are the result of running the flow graph that is shown in Figure 19-2, where the bubbles identify the string that is printed by each one of the three nodes in the graph.

The middle node, gpu_node, is an opencl_node that prints out “OpenCL_Node\n”. To this end, it will be configured to run the following OpenCL kernel that is stored in an hello.cl file:

The hello.cl file includes the definition of the cl_print() kernel that will be executed by a special node of the flow graph, an opencl_node. If we take a closer look at the kernel function, it actually prints whatever array of chars happens to arrive as an input argument. In addition, to exert a noticeable impact, the kernel also changes the string by capitalizing only the lowercase letters. The global keyword preceding the char *str declaration of the argument states that the array of chars should be stored in OpenCL global memory. For what matters here (i.e., oversimplifying), this means that the string is stored in a region of memory that can be “somehow” read and written by the CPU and the GPU. In the common case of an integrated GPU, the global memory just sits on main memory. This means, that the opencl_node should receive as argument an array of chars. In our example, this array of chars contains the characters that read “OpenCL_Node\n”. As you are probably guessing, this message comes from the first node, in_node. Right, the pointer to the string (a in Figure 19-2) flies from in_node to gpu_node, and without user intervention, the string initialized on the CPU ends up reaching the GPU. And what message reaches out_node? Again, the pointer a that leaves gpu_node and enters out_node with the name m. Finally, this last node of the graph prints “Bye! Received from: OPENCL_NODE” where we note the change in the string, and also that the string processed on the GPU has somehow become accessible by the CPU. Now, we all crave the details of the actual implementation, so here they are in Figure 19-3.

And that’s it! Note that the GPU node configuration requires only three C++ lines of code. Isn’t it neat?

Disclaimer

As of writing this chapter, the latest version of TBB is 2019. In this version, the opencl_node is still a preview feature, which essentially means that

It is subject to change. If you rely on a preview feature in your code, double check it keeps working when updating to a newer TBB release. In the worst case, a preview feature can even disappear!
It may have little documentation and support. Indeed, opencl_node and streaming_node documentation is not abundant on the Web. There are some blog entries¹ illustrating this feature, but they are 3 years old and part of the API has changed since then.
It has to be explicitly enabled (i.e., by default it is off). To use opencl_node in our codes we have to add these three lines:

×

The fringe benefit of using this header is that you don’t need to manually include tbb/flow_graph.h nor the OpenCL header files, because they are already included within flow_graph_opencl_node.h. Actually, this header file, along with the blog entries, is nowadays our most reliable source of information about the classes and member functions that this feature provides. This chapter should be considered as a gentle introduction to the one thousand and a half lines of code included in that opencl_node header.

Okay, let’s go node by node. The first one, in_node, looks familiar if you remember the examples from the previous chapter. To refresh our minds, suffice to say that (1) the input argument of the lambda ([&](buffer_t& a)) is actually a reference to the message that will be sent to any connected node; (2) only one message leaves in_node because after the first invocation it returns false; and (3) in_node.activate() is in fact awaking the node and triggering that single message. But wait, there is something new in this node that we do have to pay attention to! The message leaving in_node has to end up in a GPU-accessible region of memory and this is why argument a is not just an array of chars, but a reference to a buffer_t. Just before the definition of in_node we see that buffer_t is an opencl_buffer of OpenCL chars (cl_char):

The opencl_buffer is the first opencl_node helper class that we will see in this chapter, but there are more to come. It is a template class that abstracts a strongly typed linear array and it encapsulates the logic of the memory transactions between the host and the accelerator. We allocate an opencl_buffer<T> using the constructor of the class, like in our example in the line a = buffer_t{sizeof(str)}, or by declaring a new object with

In both cases, we end up allocating an opencl_buffer of cl_char. The version of the OpenCL factory that we use now is based on OpenCL 1.2 and leverages the zero-copy buffer approach. This means that, internally, when calling the opencl_buffer constructor, the OpenCL function clCreateBuffer is called, and one of its arguments is CL_MEM_ALLOC_HOST_PTR. As we succinctly explained in the previous chapter, the buffer is allocated on the GPU space but a CPU-accessible pointer (the CPU view of the buffer) can be obtained using a map function (clEnqueueMapBuffer). To return the control of the buffer to the GPU, OpenCL provides an unmap function (clEnqueueUnmapMemObject). On modern chips with integrated GPUs, map and unmap functions are cheap because no actual data copies are required. For these cases, map and unmap functions take care of keeping CPU and GPU caches consistent with the copy stored in the global memory (main memory) which may or may not imply CPU/GPU cache flushes. The good news is that all these low-level chores are none of our business anymore! Newer factories with better features or supporting other accelerators could be developed and we could just use them by simply recompiling our sources. Consider if an OpenCL 2.0 factory were to be disclosed tomorrow and that our accelerator implemented fine-grained buffer SVM. Just by using the new OpenCL 2.0 factory instead of the 1.2 one, we would get for free a boost in performance (because now map and unmap operations are unnecessary and cache coherency between the CPU and GPU is automatically kept by the hardware).

Oops, sorry for letting our minds drift for a while. Let’s get back to the point. We were explaining the source_node of our example in Figure 19-3 (yes, several paragraphs ago). This source_node, in_node, just initializes an array of chars, str, with the string “OpenCL_Node\n”, allocates the opencl_buffer, a, of the appropriate size, and copies the string to that buffer using the std::copy_n STL algorithm. That’s it. When the lambda of this source_node finishes, a message with a reference to the opencl_buffer will fly from in_node to the gpu_node.

Now, remember the lines required to configure the gpu_node:

The first line uses the second opencl_node helper class that we cover in this chapter: the opencl_program class. In this line, we create the program object passing to the constructor the name of the file, hello.cl, where the OpenCL kernel, cl_print, is stored. There are other opencl_program constructors available, should we want to provide a precompiled kernel or the SPIR (OpenCL intermediate representation) version of the kernel. For the sake of keeping drifting minds away and staying focused on our example, we cover these other alternatives later.

The second line creates the gpu_node of type opencl_node<tuple<buffer_t>>. This means that the gpu_node receives a message of type buffer_t and, when done, it emits a message also of type buffer_t. Do we really need a tuple for a single argument/port? Well, the opencl_node is designed to receive several messages from preceding nodes and to send several messages to the following nodes of the graph and theses are packed into a tuple. Currently, there is no special case in the interface for a single input and output, so we need to use a single element tuple in that case. Regarding the correspondence between the opencl_node ports and the kernel arguments, by default, opencl_node binds the first input port to the first kernel argument, the second input port to the second kernel argument, and so on. There are other possibilities that will be covered later.

And do we really need to send an outgoing message for every incoming one? Well, the opencl_node is designed to support this maximum connectivity (one output port per input port) and if there are fewer inputs than outputs, or the other way around, we can always leave the corresponding ports unconnected. And do we really need to use the same data type for the input and the output? Well, with the current factory, yes. If the input port 0 is of type T, the output port 0 is of the same T type (the tuple specifying the argument types do not distinguish between input and output).

Note

The main reason supporting the opencl_node implementation decisions is that each opencl_node’s port can potentially be mapped into each OpenCL kernel argument. For an “in-out” argument, having it at both input and output of course makes sense. For an “out” argument, we still need to pass in the object that is to be written, so there is a need for an input to match the output – otherwise the opencl_node would need to allocate the objects, which it doesn’t. And finally, for an “in” argument, having it available at the output lets us forward the value, that is, pass it through unchanged to downstream nodes. So, the most practical thing was to just make all arguments in-out. We believe it makes sense if we think of the OpenCL node’s tuple as a list of arguments, and we can connect edges to any of the arguments to set/get the value before/after the execution. For an “in” argument, the corresponding emitted value is unchanged. For an “out” argument, we provide the memory to write to and later get the value. And for “in-out,” we send the value and receive the modified value.

Remember that OpenCL node is a preview feature. The TBB developers are eager for input on preview features – that’s why they’re preview features after all. They want to collect input on what’s good and what’s bad, so they can spend time on perfecting the parts of the library that people care the most about. This preview OpenCL node is supposed to be good enough to try out and provide feedback. If we have strong opinions on what needs to be added – we should speak up!

Now, the constructor of the opencl_node includes as arguments the flow graph object, g, and a handle to the kernel function that should be included in the OpenCL program file. Since the file hello.cl includes the kernel function cl_print, we use the member function: program.get_kernel("cl_print").

This means that we can have several kernel functions in the same OpenCL source file and assign each one to different opencl_nodes. And do we really have to settle just with a single program file? Well, not quite. We can instantiate the desired number of opencl_program objects if we have our OpenCL kernels distributed among several source files.

Finally, the third line of code needed to configure the gpu_node is gpu_node.set_range({{1}}). This member function from opencl_node specifies the iteration space that will be traversed by the GPU. More formally, in the OpenCL jargon, this iteration space is known as the NDRange, but let’s not dwell on these details at this point. For now, let’s take a leap of faith and just believe that the set_range({{1}}) member function results in the body of the kernel being executed just once.

Now we are done with the source_node (in_node), the opencl_node (gpu_node), and the last one in our example is a regular function_node called out_node. The corresponding code is

We see that out_node receives a message, m, of type buffer_t. Because buffer_t is really an opencl_buffer<cl_char>, the call m.begin() results in a CPU visible pointer to the string that was initially set in in_node and was later modified by the GPU kernel. Our last node just prints this string and dies.

The rest of the example is the usual flow graph glue logic that makes the edges between the nodes, wakes up the source node, and waits for all the messages (just one in our example) to pass through the nodes. Nothing new here.

However, before we start climbing the first hills of our ascent, we will do a high level of recap of what we just explained while going deeper into what happens with the message, a, that was born on the CPU, sent to the GPU and modified there, to later pass to the final node where we can see the effect of the GPU kernel execution. We hope Figure 19-4 will serve us well in this regard.

The picture assumes that the OpenCL factory is based on the 1.2 version of this standard. In this case, the message a is allocated, as an opencl_buffer, in the GPU memory space but can also be written on the CPU if we first get the CPU-accessible iterator with a.begin(). The reference to a is a message that leaves in_node and enters port 0 of the gpu_node (which will invariably cause the message – the reference to a – to leave through the port 0 of departure). Port 0 of gpu_node is bound to the first argument of the kernel function that has a compatible type (opencl_buffer<cl_char> can be casted to char *). The kernel can safely access the string without cache coherency issues because before launching the kernel, the OpenCL factory takes care of unmapping the buffer. Finally, the reference to the buffer reaches out_node, where the string is again mapped to be accessed and printed on the CPU.

Before moving on, we would like to underscore here how fortunate we should feel by not having to manually deal with all the OpenCL boilerplate code (platform, devices, context, command queues, kernel reading and compilation, kernel argument setting and launching, OpenCL resources deallocation, etc.). All that is now hidden under the hood thanks to the OpenCL factory. Besides, as we said, new factories can make our code faster or able to work with other accelerators with minor or no changes in the source code.

Where Are We Running Our Kernel?

So far so good, right? But speaking of OpenCL boilerplate code, where is the knob to control on which device we are running our opencl_nodes? In our previous example, we said that the gpu_node was running the specified kernel on the GPU. Where else, right? But what if we were lying? Disturbing, huh? Okay, let’s see first if there are more OpenCL capable devices on our machine. Hopefully there is only a single device and it is a GPU, but I wouldn’t bet my fingers on it! We’ll have to sniff it out, but we are not emotionally ready to write old-style plain OpenCL code, are we? Mercifully, the TBB OpenCL factory puts in our hands two additional and valuable helper classes (and that makes four of them already). These are the opencl_device and opencl_device_list helper classes. Let’s use them first outside the flow graph context, as we can see in Figure 19-5.

First, an opencl_device_list object, devices, is initialized by calling the function available_devices(). This function returns an iterable container with all the OpenCL enabled devices available in the first platform. Yes, only in the first available platform.² Then, we pop the first opencl_device, d, from the list and query the platform name, profile, version, and vendor. All available devices in the platform will share these attributes.

Next, with for(opencl_device d:devices), we iterate through the whole device list to get and print per-device name, major and minor version, and device type. Major and minor version information was already provided by d.platform_version(), but this one returns a string, whereas both d.major_version() and d.minor_version() return an integer. The output resulting of running this code on the MacBook on which we are writing these lines, and where we have run our previous example, can be seen in Figure 19-6.

Note

The function available_devices() is in fact not public and that is the reason we had to use this convoluted chain of namespaces:

tbb::flow::interface10::opencl_info::available_devices()

We noticed that just before the implementation of this member function inside flow_graph_opencl_node.h there is a comment stating that

// TODO: consider opencl_info namespace as public API

Since this is a preview feature in TBB, the interface is not yet completely settled. Take this into account in case this consideration eventually becomes a fact.

It may come as some surprise that there may be three OpenCL devices in a laptop! Namely, one Intel CPU and two GPUs, the first one integrated in the Intel Core i7, and the second one a discrete AMD GPU. Remember that OpenCL is a portable programming language that can also be used to implement CPU code. And look, the first OpenCL capable device is not a GPU, is the quad-core Intel CPU. Now, regarding our first example in this chapter, where was the kernel running? You are right, on the first one. The OpenCL factory chooses by default the first available device, irrespective of whether it’s a CPU or a GPU. So… we were lying!!! The kernel was running on a CPU disguised as an OpenCL accelerator. What if we have been lying here and there throughout the book? Think about it… that’s even more terrifying (unless this is the first chapter you are reading).

Okay, let’s fix this minor inconvenience. To save the day, the OpenCL factory comes with two additional features: Device Filter and Device Selector. Device Filters are used to initialize the opencl_factory with the set of devices that are available for kernel execution. All filtered devices must belong to the same OpenCL platform. There is a default device filter class, default_device_filter, that automatically gathers all available devices from the first OpenCL platform and returns an opencl_device_list containing these devices. For its part, a Device Selector, as its name suggests, selects one of the devices in that opencl_device_list. It is possible to use different device selectors for different opencl_node instances. The selection is done for every kernel execution, so it is also possible to have an opencl_node running on different devices for different invocations. The default selector, default_device_selector, lazily selects and returns the first device from the list of available devices that was constructed by device filter.

To get our gpu_node running on a real GPU, instead of

we should use

where gpu_selector is an object of our custom class gpu_device_selector:


                  gpu_device_selector gpu_selector;

and this class is presented in Figure 19-7.

The agreement (more formally, the “Concept”) is that the third argument of an opencl_node is a functor (an object of a class with the operator() member function) that returns a device. That way, instead of passing the functor we may embed a lambda expression in its place. The operator() receives an opencl_factory, f, and returns an opencl_device. Using the find_if STL algorithm, we return the first iterator, it, in the container devices() that fulfills it->type()==CL_DEVICE_TYPE_GPU. For the sake of expediency, we declared auto it and delegated to the compiler to find out that the type of it is actually


                  tbb::flow::opencl_device_list::const_iterator it = ...

To account for the possibility of not finding a GPU device, we include a fallback that returns the first device (there should be at least one! ... there is no point in having a platform without any device). The functor finishes by printing the name of the selected device and returning it. In our laptop, the output would be:

Note that the new messages are printed out by the gpu_node device selector functor when this node is activated. This is, first in_node prints its message “Hello” and passes the message to the gpu_node, which before launching the kernel selects the device (printing the boldface words of the output) and then runs the kernel. That’s something to consider: an opencl_node in a flow graph is usually activated several times, so we are better off implementing the lightest possible device selector.

For example, if the lambda expression of the std::find_if algorithm doesn’t need to print the “Found GPU!” message, it can be further simplified:

Now, if we don’t like how our source code looks having to explicitly add the gpu_device_selector class, we can substitute the functor by a lambda expression. It is kind of tricky because the operator() of this class is a templated function, remember?:

The easiest way (that we are aware of) to come up with the lambda implementation is to relay on polymorphic lambdas that are available since C++14. Don’t forget to compile the code in Figure 19-8 with the option std=c++14.

Note the (auto& f) argument of the lambda, instead of (opencl_factory<DeviceFilter>& f) that we used in the functor-based alternative. This code traverses the devices() container and then returns the second device in the list, resulting in something like


                  Available devices:
                  0.- Device: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
                  1.- Device: Intel(R) HD Graphics 530
                  2.- Device: AMD Radeon Pro 450 Compute Engine
                  Running on Intel(R) HD Graphics 530

Now that we know our device list and assuming we want to use the integrated GPU, better change the lambda to make it faster:

An even faster alternative would be to cache the opencl_device the first time we invoke the device selector. For example, in Figure 19-9, we sketch a modification of the gpu_device_selector class that was presented in Figure 19-7.

The class has now an opencl_device member variable, device. When the operator() is invoked the first time, the device list, f.devices(), is traversed to find the device that we want to use (in the example, the second available device). Then we cache it into the device variable for future uses. Note that further care to avoid data races is required if this operator can be called concurrently from different threads.

We hope you can keep secret how badly we coded the examples of Figures 19-8 and 19-9. In those snippets, we are hardcoding the device to be the second one, which works on our test machine, but it may miserably fail on other platforms. Actually, if there is a single device stored in the f.devices() container, dereferencing *(++f.devices().cbegin()) will trigger a segmentation fault. This is another example of the trade-off between portability and performance. We would be better off using the version of Figure 19-7 (commenting out the print statements) if we don’t know where the code can eventually run and the device selection time is negligible in comparison with the OpenCL computation.

Back to the More Realistic Example of Chapter 18

Do you remember the triad vector operation that we introduced in the previous chapter? It was just a basic array operation of the form C = A + α*B where A, B, and C are 1D arrays containing vsize floats, and α is a scalar that we set to 0.5 (because we can). Figure 19-10 is a reminder of the way in which our triad computation will be distributed between the GPU and CPU depending on the variable offload_ratio.

The purpose of re-implementing this example is twofold. First, by re-visiting our old acquaintance but now from the perspective of the opencl_node, we will better appreciate the benefits of this higher-level feature of the TBB flow graph. Second, going beyond the “Hello OpenCL_Node” will allow us to delve into more advanced uses of the opencl_node class and its helper classes. In Figure 19-11, we give an overview of the flow graph that we are about to implement.

As in our previous examples, the source_node (in_node) just triggers the execution of the flow graph, in this case passing a single message with the value of offload_ratio. The following node down the stream is a multifunction_node (dispatch_node). This kind of node stands out for its flexibility to send messages to the following nodes in the graph. We see that dispatch_node has five output ports, the first four targeting the gpu_node, and the last port connected to the cpu_node. The gpu_node is an opencl_node that will be configured with the appropriate triad GPU kernel, which expects as input arguments a “GPU view” of arrays A, B, and C (as in the previous chapter, they are called Adevice, Bdevice, and Cdevice). However, the gpu_node has an additional port to receive the number of iterations that will be offloaded, which depends on offload_ratio and that we call NDRange to adhere to the OpenCL notation. The cpu_node is a regular function node that receives the “CPU view” of the three arrays as well as offload_ratio so that the CPU can hold up its end of the bargain. The cpu_node has a single input port, so dispatch_node has to pack into a tuple the four variables required by the CPU. Both the gpu_node and the cpu_node pass their own view of resulting array C to the join_node, which in turn builds a tuple with both views and forwards it to the out_node. This final node will validate that the computation is correct and print out the execution time. Without further ado, let’s start with the real implementation, kicking off with data type definitions and buffer allocations in Figure 19-12.

From now on, a buffer_f is an opencl_buffer of cl_floats (OpenCL counterpart of the regular float data type). With this, we allocate Adevice, Bdevice, and Cdevice as the “GPU views” of our three arrays. The opencl_buffer class also exposes the data() member function, which we see here for the first time. This function returns a CPU-accessible pointer to the GPU buffer and also takes care of mapping the buffer so that the CPU can have access to it. This allows us to initialize the pointers Ahost, Bhost, and Chost. Using the STL generate algorithm we initialize arrays A and B with random numbers between 0 and 255, using a Mersenne Twister generator (as we did in Chapter 5).

The first two nodes of the graph, in_node and dispatch_node, are defined in Figure 19-13.

This part of the algorithm is quite straightforward. Our old friend in_node sends offload_ratio=0.5 a single time to dispatch_node. For its part, dispatch_node is of the following type:

which means that it receives a float (offload_ratio) and has five output ports that send messages that correspond to the five tuple element types. This tuple encapsulates the data type of the five output ports of this multifunction node: three buffer_f (opencl_buffers here) for the three arrays, the NDRange, and a tuple_cpu that packs all the information for the cpu_node.

The two input arguments of the lambda expression that defines the body of dispatch_node are

where we find the input message (offload_ratio) and a handle (ports) that give us access to each one of the five output ports. Now, we use function get<port_number>(ports).try_put(message) to send a message to the corresponding port_number. Four calls to this function are all we need to send the information that the GPU is waiting for. Note that the last one of these four calls puts a 1D array with just one element equal to ceil(vsize*offload_ratio) which corresponds to the iteration space on the GPU. A single message sets out on a trip to the CPU via the last port, using get<4>(ports).try_put(cpu_vectors). Previously, we have conveniently packed the CPU view of the three vectors and the vector partitioning information (ceil(vsize*offload_ratio)) in the cpu_vectors tuple.

Any questions? Sure? We don’t want to leave any reader behind. Okay then. Let’s move on to see the implementation of the next two nodes, the meat of the matter, where the real computation takes place, in Figure 19-14.

Although cpu_node is the second one in Figure 19-14, we will cover it first since it requires less clarification. The template parameter <tuple_cpu, float*> points out that the node receives a tuple_cpu and sends a pointer to float. The lambda input argument, cpu_vectors, is used in the body to unpack the pointers into the three vectors and the variable start (that gets the value ceil(vsize*offload_ratio) already computed on the dispatch_node). With this information, a parallel_for carries out the triad computation in the range blocked_range<size_t>(start, vsize), which correspond to the second part of the iteration space.

As we said, the GPU is responsible of the first part of this iteration space, known in this context as the NDRange=[0, ceil(vsize*offload_ratio)). The source code of the GPU kernel is the same as we presented in the previous chapter, and it just receives the three arrays and does the triad operation for every i in NDRange:

These kernel lines are inside the triad.cl file, hence the line:

at the beginning of Figure 19-14. The custom type tuple_gpu packs the three buffer_f and the NDRange. With this, we declare the gpu_node as

That selects the “triad” kernel of the program file and specifies our favorite Device Selector, gpu_selector.

Now comes an interesting configuration detail. Four messages reach the gpu_node and we mentioned previously that “opencl_node binds the first input port to the first kernel argument, the second input port to the second kernel argument and so on.” But wait! The kernel only has three arguments! Were we lying again!!?? Well, not this time. We also said that this is the default behavior and that it can be modified. Here is how.

With gpu_node.set_args(port_ref<0,2>) we state that the messages arriving at ports 0, 1 and 2 should be bound to the three input arguments of the kernel (A, B and C). And what about the NDRange? In our first example “Hello OpenCL_Node” in Figure 19-3 we just used gpu_node.set_range({{1}}) to specify the smallest possible NDRange with constant value 1. But in this second and more elaborated example, the NDRange is variable and comes from dispatch_node. We can bind the third port of the node, that receives NDRange with the set_range() function, as we did with the line gpu_node.set_range(port_ref<3>). This means that we can pass to set_range() a constant or a variable NDRange that comes through a port. The member function set_args() should support the same flexibility, right? We know how to bind kernel arguments to opencl_node ports, but often times kernel arguments have to be set just once and not for every invocation.

Say for example that our kernel receives the value of α that is now a user-defined argument (not hardwired to 0.5 as before):

Then we can write the following: gpu_node.set_args(port_ref<0,2>, 0.5f) which binds the three first kernel arguments to the data reaching ports 0, 1, and 2, and the fourth argument to… 0.5 (oh no! hardwired again! More seriously, nothing prevents us from passing a variable, alpha, previously set to... 0.5).

Now, let’s go for the last two nodes, node_join and out_node, that are detailed in Figure 19-15.

As indicated in boldface, node_join receives a buffer_f (from gpu_node) and a pointer to float (from cpu_node). This node is created just to join these two messages into a tuple that is forwarded to the next node. Speaking of which, the next node is out_node, a function_node that receives a message of the type join_t::output_type and does not send any output message. Note that join_t is the type of node_join, so join_t::output_type is an alias of tuple<buffer_f, float*>. Actually, the input argument of the lambda, m, has this type. A convenient way to unpack the tuple m is to execute std::tie(Cdevice, Chost) = m, which is completely equivalent to


                  Cdevice = std::get<0>(m);
                  Chost = std::get<1>(m);

The next lines of the body of out_node check that the heterogeneous computation is correct, first serially computing a golden version, CGold, of the triad array operation and then comparing with Chost using the std::equal algorithm. Since Chost, Cdevice.data(), and Cdevice.begin() are all actually pointing to the same buffer, these three comparisons are equivalent:


                  std::equal (Chost, Chost+vsize, CGold.begin())
                  std::equal (Cdevice.begin(), Cdevice.end(), CGold.begin())
                  std::equal (Cdevice.data(), Cdevice.data()+vsize, CGold.begin())

Time to sign off our code. In Figure 19-16, we add the make_edge calls and trigger the execution of the flow graph.

Note that, although the four input ports of gpu_node are connected to the preceding dispatch_node, only port number 2 of gpu_node goes to node_join. This port carries the resulting Cdevice buffer, so it is the only one we care about. The other three disregarded ports won’t feel offended.

It took us a while to explain the whole example, and still we have to add one more thing. How it compares with the async_node version that we presented in the previous chapter? Our async_node version included the OpenCL boilerplate that was hidden in the OpenCL_Initialize() function but was required because it gave us access to the context, command queue and kernel handlers. This async_node version has 287 lines of code (excluding comments and blank lines) if we use the cl.h OpenCL header or 193 lines using the cl.hpp C++ wrapper of the cl.h header. This new version based on the opencl_node feature further reduces the size of the source file to just 144 lines of code.

The Devil Is in the Details

Those of us that have previously developed OpenCL codes, know that we can “enjoy” a considerable degree of latitude if we use the raw OpenCL library directly. This flexibility doesn’t show up in the opencl_node at first glance. How can we define a multidimensional NDRange? And how can we also specify the local size in addition to the NDRange global size? And how can we provide a precompiled kernel instead of the OpenCL source? Maybe the problem is that we have not covered all the available configuration knobs yet. Let’s jump into the answers to these questions.

The main OpenCL functions needed to launch a kernel are clSetKernelArg (clSetKernelArgSVMPointer if we use OpenCL 2.x Shared Virtual Memory pointers) and clEnqueueNDRangeKernel. These functions are internally called in the OpenCL factory, and we can control which arguments are passed into them. To illustrate how the opencl_node member functions and helper functions are translated to raw OpenCL calls, we zoom in on an opencl_node in Figure 19-17.

In this figure, we use the gpu_node of the previous triad example, where we configured an opencl_node to receive three opencl_buffers and the NDRange (a total of four ports that enter and leave the node). As we explained a few pages ago, thanks to gpu_node.set_args(port_ref<0,2>, alpha), we clearly state that the first three input ports (0, 1, and 2) that carry A, B, and C vectors should be bound to the first three arguments of the kernel, and the last argument of the kernel (the multiplicative factor α) is statically bounded to a variable alpha, which does not come from previous nodes of the graph. Now, we have all the information that is required to make the four clSetKernelArg() calls that we see in Figure 19-17, which in turn work their magic to get these four arguments to appear as inputs to the kernel void triad(...) OpenCL function.

Now, let’s look at how the clEnqueueNDRangeKernel call is appropriately configured. This is one of the most complex OpenCL calls; it requires nine arguments that we list in Figure 19-18. However, this is not an OpenCL primer, and for this chapter it suffices to discuss just the five arguments from the 2nd to the 6th ones. The one identified with the variable “kernel” will be covered later, and to understand the other four we have to delve deeper into one of the fundamental concepts of OpenCL: the NDRange.

The `NDRange` Concept

An NDRange defines an iteration space of independent work items. This space is potentially three-dimensional, but it can be also 2D or 1D. The NDRange in our triad example is 1D. The argument dim in the clEnqueueNDrangeKernel call in Figures 19-17 and 19-18 should contain 1, 2, or 3 accordingly and will be properly set by the gpu_node.set_range() call. In the example of Figure 19-17, this set_range() call points out that the NDRange information arrives to port 3 of the gpu_node from a previous node of the graph. The NDRange information should be in one, or optionally two containers, which provide begin() and end() member functions. Many standard C++ types provide these member functions, including std::initializer_list, std::vector, std::array, and std::list, to name a few. If we only specify one container, the opencl_node just sets the global_work_size argument of the clEnqueueNDRangeKernel() function (identified with the variable global in Figures 19-17 and 19-18). If otherwise, we also specify a second container, the opencl_node sets the local_work_size argument (local in Figures 19-17 and 19-18) as well.

Note

As we said, the NDRange global_work_size defines the parallel iteration space that will be executed by the accelerator. Each point in this space is called a work item using the OpenCL slang (if you are familiar with CUDA, it is equivalent to a CUDA thread). Therefore, work items can be processed in parallel on the different accelerator compute units, CUs, and the corresponding computation is defined by the kernel code, that is, if our kernel function includes C[i]=A[i]+B[i], this is the expression that will be applied to each work item i of this 1D iteration space.

Now, work items are grouped into what are called work-groups (or blocks using CUDA notation). Due to architectural implementation details, work items belonging to the same work-group are more tightly related. For instance, on a GPU, it is guaranteed that a work-group will be scheduled on a single GPU compute unit. This implies that we can synchronize the work items of a single work-group with an OpenCL barrier, and these work items share a per-CU memory space called “local memory” which is faster than the global memory.

The argument local_work_size specifies the size of the work-groups. OpenCL drivers can automatically compute a recommended local_work_size if none is provided. However, if we want to enforce a particular work-group size, we have to set the local_work_size argument.

Here is where some examples will make it crystal clear. Say we have 2D arrays A, B, and C, of dimensions h x w and we want to compute the matrix operation C=A+B. Although matrices are two-dimensional, in OpenCL they are passed to the kernel as a pointer to a row-major linearized 1D cl_mem buffer. This does not prevent us from computing a 1D index from a 2D one, so the kernel will look like

although the fancy way to express the same uses the int2 type and reads as


                      int2 gId = (int2)(get_global_id(0),   get_global_id(1));
                      C[gId.y*w+gId.x] = A[gId.y*w+gId.x] + B[gId.y*w+gId.x];

To get more information about what is going on for each work item during the kernel execution, we will print out some additional information, as seen in Figure 19-19.

The first three variables, gId, lId, and grId store the global ID, local ID, and group ID of each work item, respectively, in both dimensions x and y. The next three variables gSize, lSize, and numGrp are set to the global size, local size, and number of work-groups. The first if condition is satisfied only by the work item with global ID (0,0). So only that work item prints out the various sizes and number of groups, which are the same for all work items. The second printf statement is executed by every work item and prints the global, local and group IDs for that work item. This results in the output shown in Figure 19-20, when enqueued with dim = 2, global = {4,4} and local = {2,2}.

In this figure, we depict every work item with a colored box. There are 16 work items arranged in a 4×4 grid where we identify each work-group using four different colors. Since the local size is {2,2}, each work-group is a 2×2 subspace. It’s no wonder that the number of groups is 4, but to give some formalism to this chapter we add here some invariants that we can easily demonstrate:


                    numGrp.x = gSize.x/lSize.x
                    0 <= gId.x < gSize
                    0 <= lId.x < lSize
                    gId.x = grId * lSize.x + lId.x

and likewise, for the .y coordinate (or even .z in a 3D space)

And now, how do we specify the global and local sizes for an opencl_node? So far, we have just used gpu_node.set_range({{<num>}}) in previous examples of the chapter. This would translate into dim=1, global={<num>} and local=NULL which results in a 1D NDRange with local size left to the OpenCL driver discretion.

In the general case, we may need global={gx, gy, gz} and local={lx, ly, lz}. The easiest way to achieve this is to use


                        gpu_node.set_range({{gx, gy, gz},{lx, ly, lz}});

However, as we said, any container that can be iterated with a begin() member function will also suit our needs. For instance, a more convoluted way of expressing the same would be

The resulting range has as many dimensions as the number of elements in the container, with each dimension size set to the corresponding element value. The caveat here is to specify the same dimensionality for both the global and the local containers.

To make things interesting, we have to add the TBB driver code that can launch the kernel of Figure 19-19. The most concise way we know of is to build a graph with a single opencl_node as shown in Figure 19-21.

See? Just a handful of lines of code and we are up and running an OpenCL code that adds two matrices A and B. Note that the opencl_node, gpu_node, has only a single port, port<0>, that is bound to the third argument of the kernel, matrix C, that carries the result of the computation conducted in the kernel. The input matrices, A and B, and the matrix width, w, are directly passed using set_args member function. Also note that the opencl_node has to have at least one port and it is activated only if a message lands in this entry port. An alternative to implement the gpu_node would be the following:

Where the gpu_node receives Cdevice on port<0>, the NDRange on port<1> and the remaining kernel arguments are specified with the set_range() member function. The type of the message arriving and leaving port<1> of the gpu_node is tbb::flow::opencl_range (the umpteenth opencl_node helper class, so far!), and we rely on try_put() to pass an opencl_range object initialized with the two containers.

Playing with the Offset

There are two other arguments of the clEnqueueNDRangeKernel function that we left behind (see Figure 19-18). One is the offset argument that can be used to skip some of the first work items at the beginning of the iteration space. In the current implementation of the OpenCL factory, this offset is hardwired to {0,0,0}. Not a big deal. There are two possible workarounds to overcome this limitation.

The first one consists in passing the offset to the kernel and add it to the global ID before indexing the arrays. For example, for a one-dimensional C=A+B operation we can write something like

And of course, we can adapt the NDRange to avoid overrunning the arrays. Although functional, not a super-elegant solution. Which is the super-elegant solution then? Well, we can use the opencl_subbuffer class to achieve the same result. For example, if we want to add just a subregion of vectors A and B, we can keep a simpler version of the vector-add kernel:

but pass the following arguments to the set_args() member function:


                      Adevice.subbuffer(offset, size)

and similarly, for Bdevice and Cdevice. Another alternative to create a sub-buffer of Cdevice is to call


                      tbb::flow::opencl_subbuffer<cl_float>(Cdevice, offset, size)

Specifying the OpenCL Kernel

Finally, we have to devote some time to the kernel argument (see Figure 19-18). Up to now, we have used OpenCL source files to provide our kernel. In the last example of Figure 19-21, we used again the opencl_program class:

which is equivalent to the more explicit constructor:

This is the usual approach to provide the kernel function that, on the one hand, requires compiling the source at runtime, but on the other hand, provides portability because the source will be compiled (only once at opencl_program construction) for all the available devices. Internally, the OpenCL factory relies on the OpenCL functions clCreateProgramWithSource and clBuildProgram.

If we are positive we don’t need to port our code to any other platform and/or if, for the production version, we require the last drop of performance, we can also precompile the kernel. For example, with the Intel OpenCL tool chain we can run


                    ioc64 -cmd=build -input=my_kernel.cl -ir=my_kernel.clbin
                          -bo="-cl-std=CL2.0" -device=gpu

which generates the precompiled file my_kernel.clbin. Now, we can create the program object faster using

When passing this type of file to the opencl_program constructor, the factory internally uses the clCreateProgramWithBinary instead. An additional possibility is to provide the SPIR intermediate representation of the kernel, using opencl_program_type::SPIR. To generate the SPIR version, we can use


                    ioc64 -cmd=build -input=my_kernel.cl -spir64=my_kernel.spir
                          -bo="-cl-std=CL1.2"

In both cases, the ioc64 compiler provides some useful information. The output of the last run will look like


                    Using build options: -cl-std=CL1.2
                    OpenCL Intel(R) Graphics device was found!
                    Device name: Intel(R) HD Graphics
                    Device version: OpenCL 2.0
                    Device vendor: Intel(R) Corporation
                    Device profile: FULL_PROFILE
                    fcl build 1 succeeded.
                    bcl build succeeded.
                    my_kernel info:
                           Maximum work-group size: 256
                           Compiler work-group size: (0, 0, 0)
                           Local memory size: 0
                           Preferred multiple of work-group size: 32
                           Minimum amount of private memory: 0
                    Build succeeded!

This output informs us about, among other things, the maximum work-group size, 256, and the preferred multiple of the work-group size, 32, for this particular kernel.

Even More on Device Selection

In a previous section, we realized that the laptop we are using to conduct our experiments includes two GPUs. Let’s see a quick example in which we use both of them in the same flow graph. In Figure 19-22, we link two opencl_nodes so that the first computes C=A+B and send C to the following one, that does C = C – B. When both nodes are done, we check that C == A in a regular function_node. Array dimensions are rows × cols.

On our laptop, we already know that the device list f.devices() includes three devices, and the second and third ones are the two GPUs. That way, we can safely use f.devices().begin() +1 and +2 to get the iterator pointing to each GPU, as we see in the boxed statements of Figure 19-22 for the two opencl_node definitions. In addition to targeting different GPUs, each opencl_node is configured to run two different kernels of the program fig_19_23.cl: cl_add and cl_sub. The information flowing from gpu_node1 to gpu_node2 is the opencl_buffer Cdevice. Inside the OpenCL factory, data movement is minimized and if, for example, an opencl_buffer has to be accessed by two consecutive opencl_nodes mapped onto the same GPU, the data allocated on the GPU is not moved to the CPU until the first CPU node of the graph tries to access the corresponding buffer (by using opencl_buffer.begin() or opencl_buffer.data() member functions).

In Figure 19-23, we present the program fig_19_23.cl including the two kernels referenced in the previous code. Note that instead of passing the row width as a fourth argument, we use gSz.x that contains the same value.

The output resulting from running the code of Figure 19-22 on our laptop is the following:


                  Running gpu_node1 on Intel(R) HD Graphics 530
                  Running gpu_node2 on AMD Radeon Pro 450 Compute Engine
                  gSz.x=4, gSz.y=4
                  gSz.x=4, gSz.y=4

It is also possible to have a single opencl_node changing the OpenCL device to which the work is offloaded for every invocation of the node. The example of Figure 19-24 shows an opencl_node that is invoked three times, and for each one a different device is used to run a simple kernel.

The code uses the atomic variable device_num initialized to 0. Each invocation to the gpu_node returns a different device, cyclically traversing all of them (three in our platform). Along with the following kernel:

the resulting output is


                  Iteration: 0
                  Iteration: 1
                  Iteration: 2
                  Running on Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
                  Running on Intel(R) HD Graphics 530
                  Running on AMD Radeon Pro 450 Compute Engine
                  A[0]=1
                  A[0]=2
                  A[0]=3

where we can corroborate that the elements of the array Adevice have been incremented three times in three consecutive invocations of the gpu_node and the corresponding kernel has been executed on three different OpenCL devices.

A Warning Regarding the Order Is in Order!

One final caveat that we should be aware of has to do with the order in which the messages arrive to an opencl_node when it is served from several nodes. For example, in Figure 19-25 we illustrate a flow graph, g, that includes a gpu_node fed from two function nodes, filler0 and filler1. Each “filler” sends 1000 buffers, b, of 10 integers each, of the form {i,i,i,…,i}, with i ranging from 1 to 1000. The receiving gpu_node receives both messages as b1 and b2 and just invokes an OpenCL kernel as simple as this:

As we see, it basically multiplies b1[i]=b1[i]*b2[i]. If b1 and b2 are equal (to {1,1,1,…}, or {2,2,2,…}, etc.), we should get at the output 1000 buffers with squared outputs ({1,1,1,…}, then {4,4,4,…}, and so on). Right? Sure? We don’t want to lie, so just in case, let’s double check it in the last node of the graph, checker, which validates our assumption.

The code that implements the previous graph is listed in Figure 19-26. We agree with George Bernard Shaw in that “The liar’s punishment is not in the least that he is not believed, but that he cannot believe anyone else.” As liar connoisseurs, we use in our code a try-catch construction especially devised to catch liars!

We first define buffer_i as an opencl_buffer of integers. The two “fillers” receive an integer, i, and fill a buffer_i with 10 i’s that are sent to the gpu_node. The three lines used to configure the opencl_node are too basic for us now and do not require further elaboration. The last node is the checker that throws an exception if any of the values received in the buffer processed on the GPU is not a squared integer. After making the edges, a 1000 iteration loop puts to work the two fillers. Now, the moment of truth has arrived, and the output is


                  Liar!!: 42 is not a square of any integer number

Well, we got caught! Apparently, 6*7 was computed on the GPU, instead of 6*6 or 7*7. Why? The answer is that we have not taken enough measures to ensure that messages arriving to the gpu_node are paired correctly. Remember that the body of the “fillers” is executed by tasks, and we cannot assume any particular order regarding the task execution.

Fortunately, the opencl_node comes with a handy type-specified key matching feature that will save the day. We put that feature to work in Figure 19-27.

Basically, now buffer_i is a new class that inherits from opencl_buffer<cl_int> and adds an int my_key member variable and a key() member function that returns that key. Now the fillers have to use a different constructor (buffer_i b{N,i}), but more importantly, the opencl_node receives a second template argument (key_matching<int>). This automatically instructs the opencl_node to call the key() function and wait for messages with the same key value to be passed to all input ports. Done! If we run our code with these minor modifications, we will see that now we’ve been acquitted of perjury!

Summary

In this chapter, we presented the opencl_node feature of TBB flow graph. We began with a simple Hello OpenCL_Node example that represented a first look at the opencl_node, covering just the basics of this class. We then started to dive deeper into some of the helper classes, such as the opencl_device_list that is a container of opencl_device objects, and the Device Filter and Device Selector entities. In order to illustrate other helper classes and to give a more complex example, we also implemented the triad vector operation using an opencl_node to take care of part of the computation whereas the rest is processed simultaneously on the CPU cores. While there, we better covered the opencl_buffer helper class and the set_range and set_args member functions of the opencl_node class. The NDRange concept and how to set the global and local OpenCL sizes required almost a section, where we also explained how to use the opencl_subbuffer class and other variants to provide the kernel program (precompiled or the SPIR intermediate representation). We followed up by introducing two examples that illustrate how to map different opencl_nodes of the flow graph onto different devices, or even how to change the device to which the opencl_node offloads the computation at each invocation. Finally, we described how to avoid ordering issues when an opencl_node is fed from different nodes.

One final disclaimer. Maybe we were actually lying in the end. As of writing this chapter, the opencl_node is still a preview feature, so it is subject to eventual changes. After 3 years of development, we don’t expect major changes, but we cannot promise this. If such changes end up in a future release, we do promise to write an updated edition of this chapter! Do you believe us?

For More Information

Here are some additional reading materials we recommend related to this chapter:

Alexei Katranov, Opencl_node overview. Series of entries in the Intel Developer Zone blog: https://software.intel.com/en-us/blogs/2015/12/09/opencl-node-overview .
David Kaeli, Perhaad Mistri, Dana Schaa, Dong Ping Zhang. Heterogeneous Computing with OpenCL 2.0. Morgan Kaufmann 2015.

Hiking icon in Figure 19-1 made by Scott de Jonge from www.flaticon.com .

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if you modified the licensed material. You do not have permission under this license to share adapted material derived from this chapter or parts of it.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Springer Professional

19. Flow Graphs on Steroids: OpenCL Nodes

Abstract

Hello OpenCL_Node Example

Where Are We Running Our Kernel?

Back to the More Realistic Example of Chapter 18

The Devil Is in the Details

The `NDRange` Concept

Playing with the Offset

Specifying the OpenCL Kernel

Even More on Device Selection

A Warning Regarding the Order Is in Order!

Summary

For More Information

Premium Partner

Springer Professional

Abstract

Hello OpenCL_Node Example

Where Are We Running Our Kernel?

Back to the More Realistic Example of Chapter 18

The Devil Is in the Details

The NDRange Concept

Playing with the Offset

Specifying the OpenCL Kernel

Even More on Device Selection

A Warning Regarding the Order Is in Order!

Summary

For More Information

Premium Partner

The `NDRange` Concept