1 Introduction
2 Building Blocks for a Scalable HPC Operating System
2.1 The Case for a Multi-Kernel Operating System
2.2 L4Re Microkernel and L 4Linux
Thread
or a Task
, representing an independent flow of execution or an address space, respectively. But they may also point to an Ipc_gate
, which is a communication endpoint through which any user-space program can offer an arbitrary service to whomever possesses the corresponding capability.Irq
kernel object. In conjunction with the possibility to map I/O memory regions of hardware devices directly into user address spaces, it is possible to implement device drivers outside the microkernel.Task
objects provided by the L4Re microkernel. Thus, every Linux process and the contents of its address space are known to the microkernel. Furthermore, L 4Linux multiplexes all user-level threads executing in such an address space onto its vCPUs. Thus, the L4Re microkernel is involved in every context switch of any Linux user thread. In particular, it is responsible for forwarding any exceptions raised by a Linux program to the L 4Linux kernel. Exceptions occur when a thread makes a system call, when a page fault occurs during its execution, or when a hardware device signals an interrupt. L 4Linux receives these exceptions at a previously registered vCPU entry point, to which the microkernel switches the control flow when it migrates the vCPU from the Task
of the faulting Linux user program to the address space of the virtualized Linux kernel.2.3 Decoupled Execution on L 4Linux
Task
(i.e., address space) as the Linux process, but not under control of L 4Linux. The original Linux thread context in the L 4Linux kernel is suspended while execution progresses on the native L4Re host thread. The user code running there will raise exceptions just as if it were executed by a vCPU, except that the microkernel forwards each of them to L 4Linux as an exception IPC message. A message of this type carries a thread’s register state and fault information as its payload, and is delivered by the microkernel to an exception handler. We configure L 4Linux to be the exception handler of the “decoupled” Linux user threads. An attempt to perform a Linux system call will also result in an exception, which the L 4Linux kernel can then handle by briefly reactivating the previously suspended thread context and scheduling it onto a vCPU. Afterwards, execution is resumed in decoupled mode on the L4Re host thread. Figure 3 visualizes decoupled execution; more details can be found our publications [33, 60].
rdtsc
instruction, we measured delays of up to 55 CPU cycles per iteration when FWQ is executed by a decoupled thread on a dedicated core. The execution-time jitter is reduced to 4 cycles per iteration, when FWQ is offloaded to the second socket, while L 4Linux is pinned to a single core of socket 1.
2.4 Hardware Performance Variation
Platform/property | Intel Ivy Bridge | Intel KNL | Fujitsu FX100 | Cavium ThunderX | IBM BG/Q |
---|---|---|---|---|---|
ISA | × 86 | × 86 | SPARC | ARM | PowerISA |
Number of cores | 8 | 64 + 4 | 32 + 2 | 48 | 16 + 2 |
Number of SMT threads | 2 | 4 | N/A | N/A | 4 |
Clock frequency | 2.6 GHz | 1.4 GHz | 2.2 GHz | 2.0 GHz | 1.6 GHz |
L1d size | 32 kB | 32 kB | 64 kB | 32 kB | 16 kB |
L1i size | 32 kB | 32 kB | 64 kB | 78 kB | 16 kB |
L2 size | 256 kB | 1 MB × 34 | 24 MB | 16 MB | 32 MB |
L3 size | 20480 kB | N/A | N/A | N/A | N/A |
On-chip network | Ring | 2D mesh | unkown | Ring | Cross-bar |
Process technology | 22 nm | 14 nm | 20 nm | 28 nm | 45 nm |
-
The DGEMM benchmark performs matrix multiplication. We confine ourselves to naïve matrix multiplication algorithms to allow compilers to emit SIMD instructions, if possible. This benchmark kernel is intended to measure hardware performance variation for double-precision floating point and vector operations. The SHA algorithm utilizes integer execution units instead.
-
Using John McCalpin’s STREAM benchmark, we assess variability in the cache and memory subsystems. The Capacity benchmark is intended to measure performance variation of cache misses themselves.
-
HACCmk from the CORAL benchmark suite is a compute-intensive N-body simulation kernel with regular memory accesses. HPCCG from Mantevo’s benchmark suite is a Mini-App aimed at exhibiting the performance properties of real-world physics codes working on unstructured grid problems. MiniFE is another proxy application for unstructured implicit finite-element codes from Mantevo’s package.
2.5 Scalable Diffusion-Based Load Balancing
-
Load Imbalance (max. load among processes/average load—1): Diffusion performs better than ParMetis, but worse than the three geometric methods.
-
Migration (max. no. of tasks a process imports and exports/avg. no. of tasks per process): Diffusion outperforms the other methods, especially with the Shock scenario (factor 3–10 less migration).
-
Edge-cut (max. edge-cut among all processes/avg. number of edges per process): All methods achieve very similar results, except diffusion with low imbalance tolerance at high process counts (coarse granularity).
-
Run time of the method (max. among all processes): Due to its scalability, diffusion is clearly faster than the Zoltan methods at higher parallelism and 1–2 orders of magnitude faster than ParMetis. At 6144 processes, it requires 2.5 ms only.
2.6 Beyond L4: Improving Scalability with M3 and SemperOS
3 Algorithms and System Support for Fault Tolerance
3.1 Resilience in MPI Collective Operations
No. of process | No faults | 1 fault | 2 parallel faults | 2 serial faults | 3 mixed faults |
---|---|---|---|---|---|
64 | 0.0005 | 2.0095 | 2.0094 | 4.0194 | 4.0194 |
128 | 0.0006 | 2.0123 | 2.0123 | 4.0221 | 4.0222 |
256 | 0.0007 | 2.0121 | 2.0123 | 4.0220 | 4.0219 |
384 | 0.0017 | 2.0114 | 2.0323 | 4.0404 | 4.0217 |
512 | 0.0024 | 2.0343 | 2.0110 | 4.0421 | 4.0422 |
3.2 Corrected Gossip Algorithms
3.3 Corrected Tree Algorithms
3.4 Checkpointing Scheduling
3.5 Multi-Level Checkpoint/Restart
-
Local Level: The first checkpoint level is the node’s local storage (ramdisk, SSD). While this level benefits from the lowest checkpointing cost among all levels, it cannot survive fatal faults where the node becomes unavailable.
-
XOR Level: The next level is the XOR level, at which for each node a parity segment is computed and distributed among a set of nodes (XOR group) according to [24, 47]. Each node stores its local checkpoint along with the XOR parity data. This level survives a single node of an XOR group becoming unavailable.
-
Partner Level: A more stable level is the partner scheme, where the checkpoint is stored in the node’s local storage along with the partner node’s storage (two copies). This level fails to restart the job, if a node and its partner fail together.
-
Shared Levels: The most stable levels use shared storage (among all compute nodes), namely the burst buffer and the parallel file system. However, they provide lower checkpointing bandwidth per node, because many (or all) nodes may access them concurrently.
-
Asynchronous Checkpointing: To further reduce the checkpointing costs of the lower levels, checkpoint writes are performed asynchronously. The application writes its state to the node’s local storage, providing fast write performance. Thereafter, the operation of checkpointing at the lower levels (e.g., partner, burst buffer, PFS) is performed by the daemon in the background while the application returns to the computation. During the background operation of the daemon, every incoming request for checkpointing to the same level from the job is ignored until the persisting is finished.
-
Fast Recovery: To ensure the fast recovery in case of light failures (node alive), ULFM [7] is used to provide the capability of automatically recovering the failed ranks with the most recent checkpoint and reordering the ranks to the original arrangement, transparent to the developer, application, and the resource manager.
-
Job Life Cycle and Recovery When a new job is started, a short-lived controller service is executed by the first rank on each node (i.e., the node’s local rank 0). This service forks a daemon process for each rank on the node, connects them to the corresponding ranks, and then terminates. The per-rank daemons are responsible for sending status information and receiving instructions. The per-rank daemons stay alive until the application is finished with the computation. If a rank fails, its daemon terminates and the rank is later restarted using ULFM. When a rank is restarted, or when it notices that its daemon died, it will execute the previously mentioned controller service again, which will then reconnect a new daemon process with the rank. This is done transparently to the application and the existence of both daemons and controller services remains transparent to the user and the developer.
-
Rotating Partner Nodes: To further enhance the stability of the partner level, a novel approach will be engaged in which there are no fixed partners for the nodes, instead, the partner node of a specific node is changed on each turn and chosen by the daemon. This allows reducing the probability of unavailability of checkpoints at this level. In addition, partner checkpoints and data transfers will be performed by service daemons of both nodes using Remote Direct Memory Access (RDMA) in order to have fast partner checkpoints in the background.