1 Introduction and Motivation
-
From the literature (§3), we identify those solutions that already utilize multiple storage layers and adapt existing insights for an event store.
-
We will introduce PMem as a third layer into ChronicleDB’s layout. We show various opportunities and limitations for using PMem in the respective components of an event store – such as ingestion, storage design, recovery, and index maintenance (§4).
-
With the help of micro-benchmarks, we substantiate that our approaches can improve the general query performance, the recovery speed and guarantees, as well as flatten fluctuations in ingestion speed (§5).
-
We summarize our findings and formulate future research directions (§6).
2 Background
processor | 2 Intel\({}^{\text{\textregistered}}\) Xeon\({}^{\text{\textregistered}}\) Gold 5215, 10 cores / 20 threads each, max. 3.4 GHz |
caches | 32 KB L1d/L1i, 1024 KB L2, 13.75 MB LLC |
memory | \(2\times6\times\)32 GB DDR4 (2666 MT/s), |
\(2\times6\times\)128 GB Intel\({}^{\text{\textregistered}}\) Optane™ DCPMM | |
storage | 1 TB Intel\({}^{\text{\textregistered}}\) SSD DC P4501 Series |
os & compiler | CentOS 7.8, Linux 5.6.11 kernel, OpenJDK 14.0.1 |
2.1 Persistent Memory
DRAM | DCPMM | TLC Flash | |
---|---|---|---|
Idle seq. read latency | \(81\leavevmode\nobreak\ ns\) | \(174\leavevmode\nobreak\ ns\) | \(14\leavevmode\nobreak\ \mu{}s\) |
Idle rand. read latency | \(88\leavevmode\nobreak\ ns\) | \(325\leavevmode\nobreak\ ns\) | \(206\leavevmode\nobreak\ \mu{}s\) |
Max. read bandwidth | \(85\leavevmode\nobreak\ GB/s\) | \(32\leavevmode\nobreak\ GB/s\) | \(3\leavevmode\nobreak\ GB/s\) |
Max. write bandwidth | \(46\leavevmode\nobreak\ GB/s\) | \(13\leavevmode\nobreak\ GB/s\) | \(0.6\leavevmode\nobreak\ GB/s\) |
Random reads | \(931\leavevmode\nobreak\ M/s\) | \(45\leavevmode\nobreak\ M/s\) | \(299\leavevmode\nobreak\ K/s\) |
Random writes | \(703\leavevmode\nobreak\ M/s\) | \(30\leavevmode\nobreak\ M/s\) | \(61\leavevmode\nobreak\ K/s\) |
2.2 The Event Store ChronicleDB
3 Related Work
3.1 Multi-Level Data Structures
3.2 PMem-aware Storage Engines
4 Concepts and Approaches
4.1 \(\text{TAB}^{+}\)-Tree
4.2 Storage Layout
4.3 Out-of-Order Data
5 Micro-Benchmarks
5.1 Experimental Setup
DAX
option. The operating mode of the modules is set to App Direct
.ByteBuffer
interface. We configured a tree-node size of 8 KiB and used the LZ4 algorithm to compress the nodes. Furthermore, each node maintains a spare space of 10% for absorbing OOO events. The block-size of the storage layer was set to 32 KiB.Stock
) was taken from [39]. It simulates a stock ticker with four attributes: sequence number, symbol, price, and volume. Including the timestamp, events are 28 bytes in size. If not stated otherwise, this stream was used in the experiments. The second stream (Sine
) comprises events with six 64-bit floating-point attributes. Thus, the size of an event is 56 bytes including the 8‑byte timestamp value. For the \(i^{\text{th}}\) event, the corresponding attribute values are generated as \(\sin(\frac{i\mod 1M}{1M}\cdot 2\pi)\). This allows us to control the selectivity of filter conditions on secondary attributes. By default, both data sources use increasing timestamps (i.e., \(e_{i}.t=i\)) and consist of 100M events.5.2 \(\text{TAB}^{+}\)-Tree
PMem
, we measured the wall-clock time for inserting Stock
. The results are depicted in Fig. 5 for increasing flush batch sizes. When flushing every event, the insertion rate drops to approx. 50% of the DRAM
implementation while still allowing for a respectable 2.1M insertions per second. However, with a batch size of 25 events, we achieve 95% of DRAM
performance and reduce data loss by at least 91% (losing less than 25 instead of 291 events). Additionally, PMem
reduced the recovery time after a system failure from 40 ms to below 1 ms for a tree of height 3.
Sine
into the \(\text{TAB}^{+}\)-Tree and compare the insertion wall-clock time as well as the average response time for a variety of queries. The results are summarized in Fig. 6 with the original ChronicleDB serving as a baseline. Insert performance is barely affected by any of the approaches. This is expected since writing leaf nodes is the dominant cost factor of insertion. Furthermore, aligning aggregates on cache-line boundaries had no visible effect on query performance. This could be explained by the fact that not all the PMem bandwidth is utilized in this setup. For application time point queries, the higher fan-out achieved when storing aggregates on PMem reduces execution time by approx. 15%. However, for temporal aggregation queries covering a variety of time ranges the double access (index node + aggregate) introduces a performance penalty of 10% to 25%. Finally, when using lightweight indexing for filter-queries on a secondary attribute with a selectivity of 0.1%, the benefit of a higher fan-out is partly eliminated by the extra access to PMem to read aggregates (approx. 15% improvement). In contrast, storing entire index nodes on PMem results in a reasonable performance boost for all query types (35%–40%), because index navigation and aggregate access do not incur reads on secondary storage.5.3 Address Translation
DRAM
) used in our comparison. Fig. 7 shows the average time of a sequential update, a random lookup, and the total recovery time. Each of those measurements is based on the wall-clock time of 10M operations. Even though DRAM
exhibits excellent update and lookup performance, it requires a full file scan upon recovery making it infeasible for production use cases. Compared to Flash
, the PMem
solution offers superior lookup and recovery time. However, sequential updates on Flash
are 5x faster compared to PMem. The reason for this is that the flash-based ATT maintains its right-flank in main memory. In summary, even though the very small update/lookup times are hardly noticeable outside of isolated benchmarks, PMem
can bridge the gap between significant Flash
recovery times and DRAM
access times while simplifying the implementation complexity due to its flat-array structure.5.4 Out-of-Order Handling
DRAM
red-black application time tree with persistent solutions (PMem
, Flash
with 8‑KiB pages). For the latter, we differentiate between a pure solution and one with an additional in-memory application time index (Indexed
). The following set of experiments is based on a queue size of 100 MiB (approx. 3.5M events).PMem
solutions cannot compete with DRAM
, because every event is flushed directly. However, they outperform both flash variants by 3\(\times\) (pure) and 2\(\times\) (Indexed
), respectively. During recovery, only the persistent Indexed
variants have to perform work besides opening the log. In this case, PMem
and Flash
exhibit similar performance, because the recovery time is dominated by re-building the index while the cost of scanning data is negligible. For query performance, we executed application time point queries and measure the average query time while distinguishing between HIT (query has a result) and MISS (query has no result) queries. For MISS-queries, all Indexed
variants exhibit the same performance, since only the index has to be considered. However, for HIT-queries, Flash Indexed
suffers from reading a full page from the log for each query. In contrast, PMem Indexed
can directly access the corresponding event, and thus is as quick as DRAM
.
Stock
data source. In particular, for an OOO fraction of x%, \((100-\text{x})\)% of events are inserted in application time order. From the remaining x% events, 10% are randomly uniformly distributed over the application time span. The other 90% are equally distributed over 10,000 equi-distant temporally close batches. This workload represents occasional OOO data coupled with short bursts. Fig. 9 shows the total processing time in seconds for inserting data sets with 1%, 5%, and 10% OOO fractions into ChronicleDB. Due to the limited capacity of 100 MiB, queue merges impact the write performance. As a result of unsorted data, non-indexed variants cannot take advantage of bulk-merging. Flash Indexed
performs worst for 5% and 10% because reading data in application-time order from the log incurs random access to pages. As expected, PMem Indexed
does not suffer from this drawback, and thus achieves the best of both worlds.