Skip to main content
Top

Open Access 2019 | Open Access | Book

Cover of the book

Pro TBB

C++ Parallel Programming with Threading Building Blocks

Authors: Michael Voss, Rafael  Asenjo, James Reinders

Publisher: Apress

insite
SEARCH

About this book

This open access book is a modern guide for all C++ programmers to learn Threading Building Blocks (TBB). Written by TBB and parallel programming experts, this book reflects their collective decades of experience in developing and teaching parallel programming with TBB, offering their insights in an approachable manner. Throughout the book the authors present numerous examples and best practices to help you become an effective TBB programmer and leverage the power of parallel systems.

Pro TBB starts with the basics, explaining parallel algorithms and C++'s built-in standard template library for parallelism. You'll learn the key concepts of managing memory, working with data structures and how to handle typical issues with synchronization. Later chapters apply these ideas to complex systems to explain performance tradeoffs, mapping common parallel patterns, controlling threads and overhead, and extending TBB to program heterogeneous systems or system-on-chips.

What You'll Learn

Use Threading Building Blocks to produce code that is portable, simple, scalable, and more understandableReview best practices for parallelizing computationally intensive tasks in your applications

Integrate TBB with other threading packages

Create scalable, high performance data-parallel programs

Work with generic programming to write efficient algorithms

Who This Book Is For

C++ programmers learning to run applications on multicore systems, as well as C or C++ programmers without much experience with templates. No previous experience with parallel programming or multicore processors is required.

Table of Contents

Frontmatter

Part 1

Frontmatter

Open Access

Chapter 1. Jumping Right In: “Hello, TBB!”
Abstract
Over 10 years after its first release, the Threading Building Blocks (TBB) library has become one of the most widely used C++ libraries for parallel programming. While it has retained its core philosophy and features, it continues to expand to address new opportunities and challenges as they arise.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 2. Generic Parallel Algorithms
Abstract
What is the best method for scheduling parallel loops? How do we process data structures in parallel that do not support random-access iterators? What’s the best way to add parallelism to applications that look like pipelines? If the TBB library only provided tasks and a task scheduler, we would need to answer these questions ourselves. Luckily, we don’t need to plow through the many master’s theses and doctoral dissertations written on these topics. The TBB library developers have already done this dirty work for us! They provide the best-known methods for addressing these scenarios as template functions and template classes, a group of features known as the TBB generic parallel algorithms. These algorithms capture many of the processing patterns that are the cornerstones of multithreaded programming.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 3. Flow Graphs
Abstract
In Chapter 2, we introduced a set of algorithms that match patterns we often come across in applications. Those are great! And we should use those whenever we can. Unfortunately, not all applications fit nicely into one of those boxes; they can be messy. When things start to get messy, we can become control freaks and try to micromanage everything or just decide to “go with the flow” and react to things as they come along. TBB lets us choose either path.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 4. TBB and the Parallel Algorithms of the C++ Standard Template Library
Abstract
To use the Threading Building Blocks (TBB) library effectively, it is important to understand how it supports and augments the C++ standard. We discuss three aspects of TBB’s relationship with standard C++ in this chapter:
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 5. SynchronizationSynchronization: Why and How to Avoid It
Abstract
Let us underscore this up front: if you don’t need to use the synchronization features described in this chapter, so much the better. Here, we cover synchronization mechanisms and alternatives to achieve mutual exclusion. “Synchronization” and “exclusion” should have quite a negative connotation for parallel programmers caring about performance. These are operations that we want to avoid because they cost time and, in some cases, processor resources and energy. If we can rethink our data structures and algorithm so that it does not require synchronization nor mutual exclusion, this is great! Unfortunately, in many cases, it is impossible to avoid synchronization operations, and if this is your case today, keep reading! An additional take-home message that we get from this chapter is that careful rethinking of our algorithm can usually result in a cleaner implementation that does not abuse synchronization. We illustrate this process of rethinking an algorithm by parallelizing a simple code following first a naïve approach that resorts to mutexes, evolve it to exploit atomic operations, and then further reduce the synchronization between threads thanks to privatization and reduction techniques. To this final end, we show how to leverage thread local storage (TLS) as a way to avoid highly contended mutual exclusion overhead. In the sequel, we assume you are, to some extent, familiarized with the concepts of “lock,” “shared mutable state,” “mutual exclusion,” “thread safety,” “data race,” and other synchronization-related issues. If not, a gentle introduction to them is provided in the Preface of this book.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 6. Data Structures for Concurrency
Abstract
In the previous chapter, we shared how much we dislike locks. We dislike them because they tend to make our parallel programs less effective by limiting scaling. Of course, they can be a “necessary evil” when needed for correctness; however, we are well advised to structure our algorithms to minimize the need for locks. This chapter gives us some tools to help. Chapters 14 focused on scalable algorithms. A common characteristic is that they avoided or minimized locking. Chapter 5 introduced explicit synchronization methods, including locks, for when we need them. In the next two chapters, we offer ways to avoid using explicit synchronization by relying on features of TBB. In this chapter, we will discuss data structures with a desire to avoid locks. This chapter discusses concurrent containers to help address critical data structure considerations for concurrency. A related topic, the use of thread local storage (TLS), was already covered in Chapter 5.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 7. Scalable Memory Allocation
Abstract
This section discusses a critical part of any parallel program: scalable memory allocation, which includes use of new as well as explicit calls to malloc, calloc, and so on. Scalable memory allocation can be used regardless of whether we use any other part of Intel Threading Building Blocks (TBB). In addition to interfaces to use directly, TBB offers a “proxy” method to automatically replace C/C++ functions for dynamic memory allocation, which is an easy, effective, and popular way to get a performance boost without any code changes. This is important and works regardless of how “modern” you are in your usage of C++, specifically whether you use the modern and encouraged std::make_shared, or the now discouraged new and malloc. The performance benefits of using a scalable memory allocator are significant because they directly address issues that would otherwise limit scaling and risk false sharing. TBB was among the first widely used scalable memory allocators, in no small part because it came free with TBB to help highlight the importance of including memory allocation considerations in any parallel program. It remains extremely popular today and is one of the best scalable memory allocators available.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 8. Mapping Parallel Patterns to TBB
Abstract
It has been said that history does not repeat, it rhymes.
Michael Voss, Rafael Asenjo, James Reinders

Part 2

Frontmatter

Open Access

Chapter 9. The Pillars of Composability
Abstract
In this chapter, we discuss composability: what it is, what characteristics make Threading Building Blocks (TBB) a composable threading library, and how to use a library like TBB to create scalable applications. C++ is a composable language, and TBB adds parallelism in a way that maintains composability. Composability with TBB is highly valuable because it means we are free to expose opportunities for parallelism without worrying about overloading the system. If we do not expose parallelism, we limit scaling.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 10. Using Tasks to Create Your Own Algorithms
Abstract
One of the things that we like the most from TBB is its “multiresolution” nature. In the context of parallel programming models, multiresolution means that we can choose among different levels of abstraction to code our algorithm. In TBB, we have high-level templates, such as parallel_for or pipeline (see Chapter 2), that are ready to use when our algorithms fit into these particular patterns. But what if our algorithm is not that simple? Or what if the available high-level abstraction is not squeezing out the last drop of performance of our parallel hardware? Should we just give up and remain prisoners of the high-level features of the programing model? Of course not! There should be a capability to get closer to the hardware, a way to build our own templates from the ground up, and a way to thoroughly optimize our implementation using low-level and more tunable characteristics of the programming model. And in TBB, this capability exists. In this chapter, we will focus on one of the most powerful low-level features of TBB, the task programming interface. As we have said throughout the book, tasks are at the heart of TBB, and tasks are the building blocks used to construct the high-level templates such as parallel_for and pipeline. But there is nothing that prevents us from venturing into these deeper waters and starting to code our algorithms directly with tasks, from building our own high-level templates for future use on top of tasks, or as we describe in the next chapters, from fully optimizing our implementation by fine tuning the way in which tasks are executed. In essence, this is what you will learn by reading this chapter and the ones that follow. Enjoy the deep dive!
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 11. Controlling the Number of Threads Used for Execution
Abstract
By default, the TBB library initializes its scheduler with what is typically the right number of threads to use. It creates one worker thread lesser than the number of logical cores on the platform, leaving one of the cores available to execute the main application thread. Because the TBB library implements parallelism using tasks that are scheduled on to these threads, this is usually the right amount of threads to have – there is exactly one software thread for each logical core, and the scheduling algorithms in TBB efficiently distribute tasks to these software threads using work stealing as described in Chapter 9.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 12. Using Work Isolation for Correctness and Performance
Abstract
Anyone who has been around children (or who acts like a child) knows that sometimes the only way to stop children from bothering each other is to separate them. The same thing can be said of TBB tasks and algorithms. When tasks or algorithms just can’t get along, we can separate them using work isolation.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 13. Creating Thread-to-Core and Task-to-Thread Affinity
Abstract
When developing parallel applications with the Threading Building Blocks library, we create tasks by using the high-level execution interfaces or the low-level APIs. These tasks are scheduled by the TBB library onto software threads using work stealing. These software threads are scheduled by the Operating System (OS) onto the platform’s cores (hardware threads). In this chapter, we discuss the features in TBB that let us influence the scheduling choices made by the OS and by TBB. Thread-to-core affinity is used when we want to influence the OS so that it schedules the software threads onto particular core(s). Task-to-thread affinity is used when we want to influence the TBB scheduler so that it schedules tasks onto particular software threads. Depending on what we are trying to achieve, we may be interested in one kind of affinity or the other, or a combination of both.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 14. Using Task Priorities
Abstract
The Threading Building Blocks scheduler is not a real-time scheduler, and therefore it is not suitable for use in hard real-time systems. In real-time systems, a task can be given a deadline by which it must complete, and the usefulness of the task degrades if it misses its deadline. In hard real-time systems, a missed deadline can lead to a total system failure. In soft real-time systems, a missed deadline is not catastrophic but leads to a decrease in quality of service. The TBB library has no support for assigning deadlines to tasks, but it does have support for task priorities. These priorities might be of use in applications that have soft real-time requirements. Whether or not they are sufficient requires understanding both the soft real-time demands of the application and the properties of TBB tasks and task priorities.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 15. Cancellation and Exception Handling
Abstract
More or less frequently, we all get bitten by run-time errors, either in our sequential or parallel developments. To try to assuage the pain, we have learnt to capture them using error codes or a more high-level alternative like exception handling. C++, as most OO languages, supports exception handling, which, when conveniently exercised, enables the development of robust applications. Now, considering that TBB adds task-based parallelism on top of C++, it is perfectly understandable that developers should expect that exception handling is well supported. As we will see in this chapter, exception handling is indeed well and automatically supported in TBB. This means that in case of an error, perish the thought, our code can resort to an exception handler if such is available, or terminate the whole work otherwise. Implementing support in TBB was certainly nontrivial considering that
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 16. Tuning TBB Algorithms: Granularity, Locality, Parallelism, and Determinism
Abstract
In Chapter 2, we described the generic parallel algorithms provided by the TBB library and gave a few examples to show how they can be used. While doing so, we noted that the default behavior of the algorithms was often good enough but claimed that there were ways to tune performance if needed. In this chapter, we back up that claim by revisiting some TBB algorithms and talk about important features that can be used to change their default behaviors.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 17. Flow Graphs: Beyond the Basics
Abstract
This chapter contains some key tips on getting top performance from flow graphs in TBB. The less structured nature of the TBB flow graph APIs offers an expressiveness that requires some thinking to get the best scalable performance – we dive into details in this chapter that let us tune flow graphs to their full potential.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 18. Beef Up Flow Graphs with Async Nodes
Abstract
Back in 2005, Herb Sutter wrote “The free lunch is over” paper to warn us about the dawn of the multicore era and its implications on software development. In the multicore era, developers who care about performance can no longer sit back and lazily wait for the next processor generation in order to gleefully see their apps running faster. Those days are long gone. Herb’s message was that developers who care about fully exploiting modern processors would have to embrace parallelism. At this point of the book, we certainly know this, so what? Well, we believe that today “Lunch is getting much too expensive.” Let’s elaborate on this.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 19. Flow Graphs on Steroids: OpenCL Nodes
Abstract
Does the async_node leave you yearning for more? If so, this is your chapter. Here, we will cover a high-level Flow Graph class, the opencl_node, that strives to hide the hardware details and programing nuances of OpenCL capable devices. Why OpenCL? Well, there are many reasons and to name a few: OpenCL is an open standard contributed to by members of a large consortium, it is designed to be a platform-independent API, and it aims to nimbly evolve to meet newer requirements. For instance, OpenCL has been an extension of C (not C++), but the latest OpenCL 2.2 version adds support for a subset of C++14 including classes, lambda expressions, templates, and so on.
Michael Voss, Rafael Asenjo, James Reinders

Open Access

Chapter 20. TBB on NUMA Architectures
Abstract
Advanced programmers who care about performance know that exploiting locality is paramount. When it comes to locality, cache locality is the one that immediately springs to mind, but in many cases, for heavy-duty applications running on large shared-memory architectures, Non-Uniform Memory Access (NUMA) locality should also be considered. As you certainly know, NUMA conveys the message that memory is organized in different banks and some cores have faster access to some of the “close” banks than to “far” banks. More formally, a NUMA node is a grouping of the cores, caches, and local memory in which all cores share the same access time to the local shared caches and memory. Access time from one NUMA node to a different one can be significantly larger. Some questions arise, such as how the program data structures are allocated on the different NUMA nodes and where the threads that process these data structures are running (are they close or far from the data?). In this chapter, we address these questions, but more importantly, what can be done to exploit NUMA locality within a TBB parallel application.
Michael Voss, Rafael Asenjo, James Reinders
Backmatter
Metadata
Title
Pro TBB
Authors
Michael Voss
Rafael Asenjo
James Reinders
Copyright Year
2019
Publisher
Apress
Electronic ISBN
978-1-4842-4398-5
Print ISBN
978-1-4842-4397-8
DOI
https://doi.org/10.1007/978-1-4842-4398-5

Premium Partner