1 Introduction
-
We provide the first analyses of actual (not prototyped or simulated) PMem based on Intel’s Optane DC Persistent Memory Modules (PMM). We highlight the impact of the physical properties of PMem on software and derive guidelines for efficient usage of PMem (Sect. 2).
-
We investigate different algorithms for persisting large data chunks (database pages) in a failure atomic fashion to PMem. By combining a copy-on-write method with temporary delta files, we achieve significant speedups (Sect. 3.2).
-
We introduce an algorithm for persisting small data chunks (transactional log entries) that reduces the latency by \(2\times \) compared to state-of-the-art algorithms (Sect 3.3).
-
We introduce a new abstraction on top of PMem, called Failure-Atomic Memory (FAM) that allows for fast in-place updates on PMem (Sect. 3.4).
-
We show how synchronous persistent writes to PMem can be interleaved using fibers to avoid stalling on the relatively high PMem latencies (Sect. 3.5).
2 PMem characteristics
CPU | Intel Xeon Gold 6212U |
---|---|
Frequency | 2.40 GHz (\({3.90}\hbox { GHz}\)) |
# Cores | 24 |
L1 I+D Cache (per core) |
\({64}\hbox { kB}\)
|
L2 Cache (per core) |
\({1}\hbox { MB}\)
|
L3 Cache |
\({35.8}\hbox { MB}\)
|
# AVX-512 Units | 2 |
CPU Supported Memory | \({1}\hbox { TB}\) (DRAM + PMem) |
DRAM |
\({192}\hbox { GB}\,(6 \times {32}\hbox { GB})\)
|
PMem |
\({768}\hbox { GB}\,(6 \times {128}\hbox { GB})\)
|
2.1 Setup and configuration
ipmctl
2: To avoid complicating the following experiments with a discussion on NUMA effects (which are similar to the ones on DRAM) we run all our experiments on socket 0
. Once a region is created, ndctl
3 is used to create a namespace on top of it: Next, we create a file system on top of this namespace (mkfs.ext4
4) and mount it (mount
5) using the dax
flag, which enables direct cache-line-grained access to the device by the CPU: Programs can now create files on the newly mounted device and map them into their address space using the mmap
6 system call:
2.2 Latency
lfence
), as these still allow for speculative loads and introduce a small CPU overhead. To minimize caching effects, we use an array sufficiently larger (\({8}\hbox { GB}\)) than the CPU cache (\({32}\hbox { MB}\)). The results of this experiment are shown in Fig. 1.sfence
has to be used to wait for the data to reach PMem. This process is described in more detail in Sect. 3.1. To measure the latency for persistent store operations7 on PMem, we use a single thread that persistently stores data to an array of \({10}\hbox { GB}\) in size. Each store is aligned to a cache line (\({64}\,{\text {bytes}}\)) boundary. The results are shown in Fig. 2.
flush
, flushopt
, clwb
, and non-temporal stores (_mm512_stream_si512
).clwb
. Intel has added opcode to allow software to use it, but implement it as flush_opt
for now. Therefore, non-temporal operations and clwb
should be preferred over flush
and flush_opt
.
2.3 Bandwidth
Peak read BW | Required #threads | Peak write BW | Required #threads | |
---|---|---|---|---|
DRAM | 113.8 | 15 | 92.5 | 17 |
PMem | 39.1 | 17 | 12.5 | 3 |
clwb
instruction
, and blind writes realized by a non-temporal (or streaming) store (i.e., _mm512_stream_si512
)
. For both, DRAM and PMem, the blind stores provide the best throughput because the modified cache lines do not have to be loaded first—thereby saving memory bandwidth. On PMem, however, there is an additional benefit when using non-temporal stores as they bypass the cache and force the data to the PMem DIMM directly.clwb
in Fig. 4: On DRAM, the extra instruction only adds additional CPU overhead to a very tight loop and thus causes a slowdown compared to regular stores
. With an increasing number of threads this overhead no longer impacts the overall throughput, as the bottleneck shifts from CPU-bound to memory-bound. On PMem, in contrast, the performance of regular stores
can be increased by issuing a clwb
instruction after each store
. By forcing the cache lines to PMem right after they are modified, we can ensure that the ones that are modified together also arrive at the PMem DIMMs together and can thus be written together by the write-combining buffers. In other words: By using the clwb
instruction, we are preserving the spatial locality of the writes when going from the CPU to the PMem DIMMs.clwb
becomes more important with several threads than with a single one, because cache lines are evicted more randomly from the last level CPU cache, and thus arrive increasingly out of order at the PMem write-combining buffer. Starting at 12 threads for 256B chunks, regular stores followed by a clwb
become as fast as non-temporal stores
. However, this is likely due to the performance drop experienced by the non-temporal stores due to the over-saturation of PMem. Compared to DRAM, where there is only a difference between blind writes
and regular ones
, on PMem there is also a difference whether we ensure spatial locality of modified cache lines at the PMem DIMM
or not
. Thus, on PMem we end up with three discrete optimal throughput numbers (when considering the peaks) for regular stores
, regular stores followed by a clwb
instruction
, and non-temporal store
. While there is a minor CPU overhead for using clwb
, our experiments do suggest that the potential bandwidth benefit is worth it.-
Algorithms should no longer be designed to fit data on single cache lines (\({64}\,{\text {bytes}}\)) but rather cluster data on PMem blocks (\({256}\,{\text {bytes}}\)).
-
When possible, non-temporal stores should be utilized, otherwise the regular stores should be followed by a
clwb
instruction. -
Over-saturating PMem can lead to reduced performance with as little as four threads.
-
The experiments showed that currently the PMem read bandwidth is \(2.9\times \) lower and the write bandwidth \(7.4\times \) lower than DRAM. Therefore, performance-critical code should prefer DRAM over PMem (e.g., by buffering writes in a DRAM cache).
2.4 Interference
3 Building blocks for PMem
3.1 Failure atomicity
clwb
) or flush CPU instructions (flush
or flush_opt
). This implies that any persistent data structure on PMem always needs to be in a consistent (or recoverable) state, as any modification to the structure could become persistent immediately. Otherwise a system crash—interrupting an update operation—could lead to an inconsistent state after a restart. The following code snippet shows how an element is appended to a pre-allocated buffer:clwb
(cache line write back) is used, which is an optimized flush operation designed to evict the cache line without invalidating it. Before the buffer’s size indicator (next
) can be changed, an sfence
(store fence) must be issued to prevent re-ordering by the compiler or hardware (line 5). Once next
has been written (line 6), it is persisted to memory in the same fashion (line 7, 8). Note that persisting the next
field is not necessary for the failure atomicity of a single append operation. However, it is convenient and often required for subsequent code (e.g., another append). Hereafter, we will use the term persistency barrier and persist for a combination of a clwb
and a subsequent sfence
:3.2 Page propagation
libpmemblk
library.3.2.1 Failure-atomic page propagation
SHA-256
implementations is reported11 at roughly \({3.5}\hbox { GB/s}\) per core). In our implementation we use CRC32
, which is supported directly by modern CPUs (_mm_crc32_u64
) and works almost at line rate (we measured a throughput of \({10.3}~\hbox {GBs}^{-1}\)). While CRC32
does not provide as good of a collision resistance, it does model the best case scenario for the check-sum-based page propagation as it incurs the lowest overhead. However, our experimental results, even for CRC32
, showed no performance advantage compared to the failure atomic copy-on-write implementation12. Therefore, we argue that the additional system complexity, recovery time, and storage overhead (for snapshots) is not worth it and failure-atomic page propagation should be preferred.3.2.2 Copy-on-write
vp
) to a used persistent page (pp
). Once the volatile page (vp
) is written (line 2) and persisted using a persistency barrier (line 3), it is marked as valid (line 7-11) and the old PMem page can be reused. During recovery, the headers of all PMem pages are inspected to determine the physical location of each logical page. To avoid invalidating unused pages before they can be written again, we use a per page monotonically increasing page version number (pvn) to determine the latest version of the page on disk.
B
and A
, respectively. Both slots are shaded in green to indicate that they currently hold a valid page. The page slot in row
contains an older version B
, which can be determined by inspecting the pvn: a lower pvn indicates an older page version. The different versions of page slot
show each step (cf. Listing 1) of flushing a new version of page A
to this currently (state 1) unused slot. The line numbers in the pseudo code where the transition might occur are written over the arrows. In each step, the pvn can be used to determine the most recent version of each page.B
and is therefore ready to be overwritten by a new version of page A
. In state 2, the pseudo code is run until the persistency barrier in line 3 (persist
). At this point only the payload has been updated and the page would therefore still be identified as an old version of B
. Next, this newly written version of page A
has to be made valid by updating the pid and pvn. It is crucial that the pid is updated before the pvn, otherwise there is a brief time window in which the updated pvn would identify the page as the latest version of page B
despite it storing data of page A
. We ensure this ordering by placing both (pid and pvn) on the same cache line and separating the two store operations with an sfence
. This way, state 3 and state 4 are the only possible versions of the page during these updates. In each case, the page is correctly identified as an outdated version of page B
or the new version of page A
. The final persistency barrier in line 10 ensures that the update is completed before continuing. Note: If page slot
would have started out with a higher pvn than the one of A
(e.g., 10), it would have already been identified as the latest version of A
in state 3 (which is fine because the payload has already been updated).sfence
, which is much cheaper as it does not stall on a preceding clwb
). Using this technique, we measured a \(\approx 10\%\) increase in throughput.3.2.3 Micro log
3.2.4 Experiments
3.3 Logging
3.3.1 Algorithms
validity_bit=1
). Once the log file is full, it can be reused by flipping the validity bit.popcnt
instruction). Next the header, data, and bit count (cnt
) is written to the log and persisted together. Using the bit count, it is always possible to determine the validity of a log entry: Either the cache line containing the bit count was not flushed or it was. In the former case, the field contains the number zero (because the file was zeroed) and the entry is invalid. In the latter case, the bit count field can be used to determine whether all other cache lines belonging to the log entry have been flushed as well. Compared to RAWL, the code for writing and reading the log is less complex and only requires a logarithmic space overhead (pop_count
field) instead of a linear one (1 validity bit per 63 bits of log data).3.3.2 Experiments
3.4 In-place updates
clwb
and sfence
). For larger in-place updates, as commonly used in any PMem-based data structure [2, 9, 15, 27, 48, 52], either copy-on-write or log-based techniques are used. Both techniques require at least two persistency barriers, thus slowing down the update throughput.Required | #Cache lines written | #Persists | |||
---|---|---|---|---|---|
Size (byte) | \({16}{\text {B}}\) | \({32}{\text {B}}\) | \({64}{\text {B}}\) | ||
CoW | \(2n + 1\) | 2 | 2 | 3 | 2 |
Log | \(n + c\) | 2 | 2 | 3 | 2 |
FAM | \(\lceil 8n/31\rceil * 8\) | 1 | 2 | 3 | 1 |
3.4.1 CoW-based
a
or b
) is currently active. The memory consumption could be optimized by sharing the “unused” buffer over multiple CoW structures. However, this would incur an additional cache line miss (pointer chase). Additionally, by moving the actual data behind a pointer (out-of-place), we would avoid the actual issue we are trying to solve here: in-place updates. Therefore, the depicted algorithm keeps both versions in-place and could be used on a single node in a tree-like data structure (thus avoid memory allocation and reclamation issues and also keeping it in a flat memory format that can be easily written to disc). The update process inherently requires two persistency barriers to avoid any corruption in case of a crash, because the new data needs to be fully written before it can be set valid. For both, reading and writing, only one cache line has to be touched for \({16}\,{\text {bytes}}\) of data.3.4.2 Log-based
3.4.3 Failure-atomic memory (FAM)
new
) is copied to a backup location (old
) (line 4), and the new user data is written (line 5). Once the FAMB is updated it is written back to memory (6). This process is performed for each 4-byte block of the user data. Because FAMBs only store \({31}\,{\text {bits}}\) of user data, the most significant bits of each 4-byte input block are extracted (line 13-14) and stored in an additional fifth FAMB (line 16).Update()
call) and the five blocks toward the bottom show the five FAMBs. Our algorithm only ensures that no intermediate state of a single FAMB is leaked to PMem, however individual FAMBs of one FAM may be written back before others. In case of a crash before everything is committed to PMem (persist
), the program can inspect the \({2}\,{\text {bit}}\) version number during recovery: If the version numbers of all FAMBs match, the FAM is in a consistent state (either old or new). Otherwise, only some FAMBs have been persisted and need to be rolled back. The version number (\({2}\,{\text {bit}}\)) provides 4 states (0, 1, 2, and 3) and increments may trigger overflows (inc(3) = 0
), making it possible to determine which FAMB has the more advanced version and needs to be rolled back. A rollback requires the version number to be decremented (with underflows: dec(0)= 3
) and the recovery of the old version (l.new = l.old
). In case of repeated crashes, a single FAMB is only rolled back once because during subsequent recoveries the version number already matches the other FAMBs. Hence, the recovery of FAM is idempotent and guarantees progress as rollback actions do not need to be repeated.3.4.4 Experiments
clwb
). The logging-based approach is assigned a sufficiently large log file for the workload, such that it never has to be re-initialized. The figure shows performance for a sequential access pattern (a, b, c) and a random access one (d, e, f). The throughput of writes (a, d), reads (b, e), and dependent reads (c, f) are depicted from left to right. For dependent reads, out of order execution is prevented by making a read location dependent on the previously read value.3.5 Coroutines
3.5.1 FP-tree implementation
used
) indicating which slots are filled, an array of finger prints (fps
) that stores a 1-byte hash of each key, and an array with key-value pairs (kvs
). We use these leaf nodes to measure the effect of interleaving lookups as well as interleaving inserts.3.5.2 Lookup implementation
_mm_prefetch
) instruction to get the requested cache line from the underlying memory (DRAM or PMem). Instead of waiting for the cache line to be loaded, we use co_await
to return the control flow to the caller. The caller can then continue execution by resuming the next active coroutine or starting a new one. This way any number of lookups can be executed in an interleaved fashion and while one is waiting for memory to be loaded, another one can progress.3.5.3 Insert implementation
used
bit mask and the data is the key-value pair (kvs
) and the key’s fingerprint fps
. We can interleave multiple writes, by issuing the write and cache line write back instruction normally and then using one storage fence (sfence
) for a group of inserts to force the data to PMem. Hence, the algorithm only has to wait once for the completion of all cache evictions, but each individual insert operation still has the guarantee that its data was persisted before it continues. Figure 11 illustrates three inserts with individual fences (top) and shared ones (bottom).3.5.4 Experiments
clwb
) and storage fence (sfence
) instruction. The sfence
instruction allows for reordering of loads and only “fences” stores. To test this hypothesis, we used a memory fence (mfence
) instruction instead of the sfence
. The mfence
does not allow for any reordering. In this scenario, the performance of inserts with interleaved reads
becomes similar to that of inserts with interleaved writes
.4 Related work
5 Conclusion
clwb
or non-temporal stores) is essential for a high write bandwidth. (4) When using PMem and DRAM at the same time, there are interference effects cause significant slowdowns.