1 Introduction
1.1 Intermediate representation levels
1.2 Contributions
1.3 Outline
2 Query translation
2.1 Query plan to IR
produce
or consume
methods. We call these methods operator emitters. Figure 2 illustrates the operator emitters that are executed during translation of a sample query. The build pipeline on the left of the join populates the join hash table. It was translated previously. The probe pipeline, surrounded by the dotted line, accesses the join hash table. We look at its translation in detail..produce(...)
. It contains a loop that iterates over the table R and reads its tuples. The code for selection was emitted by
\(\sigma \).consume(...)
and now the hash join follows with
\(\bowtie \).consume(...)
. The implementation of the method is shown in Fig. 3, which uses a notation similar to Kersten et al. [15]. Code lines following an emit statement are underlined to emphasize that this code is not executed immediately but instead placed in the JIT query. For instance createHashtable(..)
is not underlined (line 2) and is therefore executed during translation. By contrast, ht_ins(..)
is underlined (line 3) and is therefore placed in the compiled code. This leads to repeated execution of the line for every tuple of the scanned table.
.consume(...)
is called from its right child, and therefore the probe-side code is produced (lines 7–13). The code first initializes the variable
\( entry \), which holds hash probe results (line 7) and then loops over the hash join matches (lines 8–13). In the loop, we first call ht_get(...)
to retrieve the next match (line 9) and then perform a check to exit when no more matches exist (lines 10–11). To process join matches, we read the attributes of the match to registers (line 12) and then the join’s parent operators place their code by calling consume(...)
(line 13).{r_a}
, {s_a}
, and {s_b}
and the locations of hash table entries in {entry}
. The hash_get(...)
call is realized with
and the loop over the probe matches with a combination of compare
and two jumps
. To read attributes from a hash table entry (dematerialize), we use
from a memory location in brackets []
to e.g., {s_a}
.2.2 IR to machine code
x86_64
[22].{s_a}
. Further, the machine assembly uses additional
instructions to transfer values between registers and the stack, e.g.,
. The translation process from Floun-der IR to machine code needs to manage machine resources such as registers and stack memory and find an efficient way for their use during JIT query execution.2.3 ReSQL translation mechanisms
10.0 + 0.34
, a sum of two decimal constants, as example. ReSQL uses 64 bit integers for decimal arithmetics and thus represents the values as 100
and 34
along with the base and precision. The precision is the number of digits in total and base is the number of digits following the decimal point.
decimal(3,1)
and decimal(3,2)
are given by the constants. The expression translator applies type rules to derive the typed expression tree shown in Fig. 5. One typecast was inserted to maintain the same base for the add. Then, the second step emits Flounder IR for the expression tree. Starting with the leaf expressions, code for the evaluation of each node is emitted. The resulting Flounder IR to evaluate the expression is shown in Fig. 6. The code uses, e.g.,
and
to indicate the validity range of {x}
. First both constants are loaded. Then, the typecast for {dec_const1}
is evaluated by multiplying with 10
. Finally, the add is evaluated and the result is stored in {add_res0}
. The IR-code is inserted into the code frame of the query and translated to machine code along with the query.
Values
namespace for this purpose. These are shown in Fig. 7. To evaluate the projection expressions from a select
-clause, for example, we use tup=Values::evaluate(projs)
. The result tup
is a list of virtual registers that hold the expression results, ultimately a tuple. Similarly, lists of virtual registers are used to hold tuples after scanning them or when applying a hash function.3 Lightweight abstractions
x86_64
assembly, but it adds several lightweight abstractions. The abstractions are designed with the interface to the query compiler and with the resulting machine code in mind. In this way, Flounder IR passes just the right set of information into the compilation process. For operator emitters, the IR provides independence of machine-level concepts, which allows similar code generation as is typically performed with LLVM. For translation to machine code, the abstractions are sufficiently lightweight to avoid the use of compute-intensive algorithms. Additionally, the IR contains information about the relational workload that enables efficient tuning of the machine code.x86_64
assembly and use additional tokens, which are shown in braces, e.g., {param1}
.3.1 Virtual registers
{s_a}
and {s_b}
to reach around the operators that are contained in the probe loop.3.2 Function calls
ht_ins(...)
with parameters param1
to paramN
and the return value is stored in {res}
. A pointer to the function code is provided as an address constant via {ht_ins}
. This pseudo-instruction is later replaced with an instruction sequence that realizes the calling convention.3.3 Constant loads
imm
) on current architectures. To use large constants, they have to be placed in machine registers. The constant load abstraction in Floun-der IR, allows using such constants without restrictions. For example,
{0x7fff5a8e39d8}
+{offs}
to the virtual register {attr}
. During translation to machine assembly, the address constant will be placed in a machine register.3.4 Transparent high-level constructs
While(...)
, close(...)
, andisSmaller(...)
as shown below.
loop_headN
. The
instruction then evaluates the loop condition and
jumps to the loop_footN
-label at the loop end, if the condition evaluates to false. Otherwise, the loop body is executed and after it, the loop starts over by executing the jump instruction
, which redirects control flow to the loop head.
4 Machine code translation
x86_64
machine code, the abstractions that facilitated code generation in the previous step are now replaced with machine concepts. A key challenge here is to replace virtual registers with machine registers and to manage spill memory locations for cases of insufficient registers. Finding optimal register allocations is an NP-hard problem and even the computation of approximations is expensive [10]. In the context of JIT compilers, linear scan has been proposed as a faster algorithm [30] and was adopted by LLVM. However, linear scan register allocation is still relatively expensive due to live range computations and increasing numbers of registers.4.1 Register layout
x86_64
architecture into three categories.attReg
\(_1\), ..., attReg
\(_{12}\) to carry attribute data and tuple ids. We use three temporary registers tmpReg1
, tmpReg2
and tmpReg3
, which are-multi purpose for accessing spill registers and constant loads. Lastly, we use the stack pointer
to store the stack offset. The stack base pointer
is repurposed for attribute data and not used for the stack.
4.2 Translation algorithm
x86_64
assembly in one sequential pass over the code. It replaces the Flounder abstractions with machine instructions, machine registers, and stack access. The algorithm is shown in Fig. 9.tmpReg
\(_1\) to tmpReg
\(_3\) to hold values temporarily per instruction. Three registers are sufficient for this purpose as this is the highest numner of non-immediate operands per instruction. As an example, we look at the following instruction.
{tid_os}
from the memory address 0x7f...
and stores it in {r_a}
. The address is too large for an immediate operand and we assume for illustration purposes that both virtual registers {r_a}
and {tid_os}
are spilled.tmpReg
\(_1\)
to the operand {r_a}
. This is the only output operand of the instruction, and the operator emits a store to {r_a}
’s spill slot on the stack. Step
assigns tmpReg
\(_2\)
to the operand {tid_os}
. The translator emits a load to retrieve the value from its spill slot. Step
assigns tmpReg
\(_3\)
to the constant load of address 0x7f...
. The translator emits a load for the constant. This results in the following machine code sequence, which includes the original
instruction with replaced operands.
x86_64
calling convention, the calling function preserves up to 7 integer registers (caller-save registers) and passes up to 6 parameters in integer registers before using the stack for parameter passing [22].ht_get(..)
from a previous example (Fig. 4).
{entry}
, the function address {ht_get}
, and the parameters {ht}
, {r_a}
and {entry}
. To derive the call-convention instruction sequence, the translator first replaces these operands with the already allocated machine operands (lines 1–3).
0x25cac0
to
,
to
, and
to
.5 Getting more out of Flounder
5.1 Utilizing additional database knowledge
NULL
-logic or types with multi-register representations (e.g., 128 bit decimals). The translator has the opportunity to apply simpler or unified logic to handle such characteristics.5.2 Memory optimizations via prefetching
{scanLoc}
with the relation address {rel}
and iterates over the relation’s 16 byte tuples.
{scanBase}
to iterate over the relation in steps of four tuples. After checking the loop condition, a prefetch for the succeeding iteration is issued. The unrolled loop body executes four iterations, which collectively read one cache line. By matching the loop granularity with the prefetching granularity, efficient prefetching of the scanned tuples is added.5.3 Higher-level IRs
t[0]
to the hash table key location k1
of the bucket new_pos
. The scatter is executed conditionally depending on the flag can_scatter
. Translation to Flounder IR can implemented as a operator emitter, similar to Sect. 2.1. The Voila operation translates a short sequence of Flounder instructions:
cmp
and je
instructions evaluate {can_scatter}
to skip processing if necessary. Then mov
performs the actual write of {t0}
to the hash bucket with base address {new_pos}
and an exemplary offset +4
.6 Evaluation
AsmJit
library [17] to avoid the overhead of running external assemblers, e.g., nasm
. For LLVM IR, the machine code is generated by the LLVM library’s JIT functionality. We use O0
and O3
optimization levels for tradeoffs between compilation time and code quality.v0.5-222
, which executes queries by JIT compiling via LLVM. We use DuckDB version v0.2.5
, which executes queries with vector-at-a-time processing [7] for cache-efficiency. In its current development state, ReSQL only supports single-threaded execution. We configured all systems to run single-threaded for a fair comparison. Furthermore, ReSQL’s query planner does not yet support sub-queries. Therefore, we only use benchmark queries that do not contain sub-queries.6.1 Compilation times
O0
optimization LLVM has compilation times between 10 ms up to 265 ms. With O3
compilation times range from 28 ms up to 657 ms. For both levels, the graphs show super-linear growth of compilation times with query complexity. Flounder shows lower compilation times that scale linearly between 0.3 ms to 10.8 ms. The highest factor of improvement is 24.6x over llvm-O0. and 60.9
\(\times \) over llvm-O3 (both for 100 join relations). For
\({\mathbf {Q}}_{\pi }\) Flounder has very low compilations times ranging from 0.1 ms (50 attributes) to 0.6 ms (500 attributes). This leads to factors of improvement up to 933
\(\times \) over llvm-O3. We attribute this to the time LLVM spends on register allocation. This is due to the large number of virtual registers used for carrying attributes.
6.2 Machine code quality
6.3 Post-projection optimizations
a
\(_2\) to a
\(_p\) only for tuples that pass the filter (1% of the relation) instead of performing a full scan. We analyze how the code generation strategies handle post-projection optimization by executing
\({\mathbf {Q}}_{\pi }\)with
\(p=\{10,50,100\}\). We use the llvm-based techniques, Flounder (naive), and Flounder (p.proj). The technique Flounder (p.proj) produces IR with explicit post-projection; the other techniques produce IR with full scans.
6.4 Prefetching optimization
prefetch
optimization reduces the number of stalled cycles by up to
\(15\,\%\) for
\({\mathbf {Q}}_{\sigma }\) and up to
\(6\,\%\) for
. At the same time, execution times reduce up to
\(11\,\%\) (
) and
\(14\,\%\) (
\({\mathbf {Q}}_{\sigma }\) with
\(s=20\)). In particular, for
, calls of external functions (e.g. building the hash table) interfere with the prefetching. As a result, the stalls are reduced less, compared to
\({\mathbf {Q}}_{\sigma }\). This may be improved in the future by adapting the prefetching distance. The results also show an increase in compilation time for applying the optimization. However, even with prefetching optimizations, compilation times remain below 0.4 ms, which is sufficiently low to get an overall benefit.