1 Introduction
-
We describe DynQ, a language-agnostic query engine which can execute queries on collections of objects as well as on file data (e.g., JSON files) and other data sources without requiring any data schema (neither provided nor inferred). DynQ is able to optimize itself on the data types encountered during query execution.
-
We describe DynQ’s approach to query compilation, which relies on self-optimizing abstract syntax tree (AST) interpreters and dynamic speculative optimizations.
-
We introduce reusable compiled queries, a novel approach to query execution relying on an efficient and flexible cache of compiled queries which allows reusing the same compiled code to execute similar queries.
-
We evaluate DynQ on workloads designed for both databases and programming languages. Our evaluation shows that DynQ’ performance is comparable with a hand-optimized implementation of the same query and outperforms implementations based on built-in or third-party data-processing libraries in most of the workloads.
-
We release an open-source prototype of DynQ.1
2 Background
2.1 Language-integrated queries
where
and select
clauses in the query comprehension are de-sugared into calls to the methods Where
and Select
defined in the IEnumerable
interface.IEnumerable
or IQueryable
; indeed, from a theoretical point of view, LINQ queries can be executed on any data type that exhibits the properties of a monad [20]. This great flexibility is obtained through so-called LINQ providers, i.e., data-source specific implementations of the mentioned generic types. Relevant examples of LINQ providers are LINQ-to-XML (that queries XML documents) and LINQ-to-SQL, which converts query expressions into SQL queries and sends them to an external DBMS. Despite the benefits it provides, LINQ was explicitly designed targeting statically typed languages, and it is currently not available in dynamic languages. Our work overcomes this limitation.2.2 Query execution models
MoveNext()
and Current()
in C\(^\sharp \), or hasNext()
and next()
in Java), which introduce non-negligible overhead, since they are executed for each input row of each operator in the query plan. In the context of relational databases, the most relevant optimizations for removing such overhead are vectorization [7] and data-centric query compilation [44].2.3 GraalVM and the Truffle framework
3 DynQ
-
Language-independence and modularity: DynQ should be able to execute queries on any collection of objects from any language supported by GraalVM. Moreover, integrating new data sources and query operators in DynQ should impact only their respective components, i.e., their implementation should be language-independent.
-
High performance: Query execution with DynQ from a dynamic language should be as efficient as a hand-optimized application written in the same language.
3.1 DynQ architecture
LIKE
function on strings, and the EXTRACT
function on dates. Moreover, DynQ seamlessly supports user-defined functions (UDFs), as it can directly inline code from any GraalVM language into its SQL execution code. In this way, UDFs from any of the GraalVM languages can be called during SQL evaluation with minimal runtime overhead. In particular, it is not required that a UDF is written in the same language as the application that is using DynQ, e.g., it is possible to use DynQ from JavaScript, executing a query with a UDF written in R.3.2 Query compilation in DynQ
ExpressionNode
, ConsumerNode
, and ExecutableNode
are shown in Fig. 2. When the query root operator is not a materializer (e.g., for queries composed of projections and predicates), DynQ adds a custom consumer which fills a list of rows, since DynQ always outputs an array data structure. On the other hand, when the root operator is a sort or an aggregation, DynQ returns the sorted (or aggregated) data, which is already a list of rows.ConsumerNode
methods consume(row)
and getResult()
. As an example, if a query has a group-by operator (which is not the root operator in the query plan), its implementation of consume(row)
updates the internal state (a hash-map) and the implementation of getResult()
, which is invoked by its source operator once all input tuples have been consumed, sends all tuples from the aggregated hash-map to its destination (a ConsumerNode
), calling the consume(row)
method for each aggregated row, and finally returns the value obtained by calling the getResult()
method on its destination consumer. Since push engines do not allow terminating the source iteration, i.e., an operator cannot control when data should not be produced anymore by its source operator, DynQ implements early exits for the limit query operator by throwing a special EndOfExecution
exception.
LessThanNode
node leverages Truffle specializations for implementing the less-than operation. The LessThanNode
implementation shown in Fig. 5 presents only the specializations for Int and Date types, because those types are the ones used in the example. The actual implementation contains specializations for all types supported in DynQ as well as their possible combinations (e.g., Int/Double and Double/Int). In particular, DynQ specializations with mixed types respect the implicit type conversion (i.e., type cast) semantics commonly integrated in a query planner, but in DynQ the detection of such casts must take place during data processing (instead of during query planning), since at query planning time types are not known in DynQ. Consider the method execute(Object row)
defined in the class LessThanNode
. This method first executes the left and right children expression nodes (i.e., property reads in the example query). Then, the method call to executeSpecialized
(internally) performs a type check for the two arguments (i.e., fst
and snd
). If both values have type int
, the specialization execute(int, int)
is executed; if they are both dates, the method execute(LocalDate, LocalDate)
is executed; otherwise, the current tuple is discarded. Note that, although our current implementation is permissive, i.e., it does not stop the query execution throwing an exception in case a malformed row is encountered, implementing different error handling strategies would be trivial.
TableScan
executable node and the ReadMember
expression nodes would specialize in different ways, depending on the runtime types. The flexible design of DynQ allows reusing the very same query-operator nodes for executing queries on different data structures, like a JavaScript array of objects or an R data frame. Thanks to this design, we achieve all the three goals listed in the beginning of this section. In particular, the extensibility and modularity of our design allow adding new data sources (e.g., a data structure in a dynamic language or an external source like a JSON file) by integrating only the expression nodes which take care of accessing data from such a data source, without requiring any modification to the query-operator nodes.LessThanNode
in the previous example). Then, once the runtime detects that some nodes are frequently executed (e.g., the main loop in TableScanNode
), it initiates machine-code generation. Once the runtime has collected type information for those rows which have been executed in the interpreter, it speculatively generates machine code assuming that the subsequent rows will have the same types. If such speculative assumptions get invalidated (e.g., because a subsequent row has an unexpected type), the compiled code gets invalidated and the execution falls back to interpreted mode. Then, the runtime can update the collected type information and later re-compile the nodes to machine code accordingly. It is important to note that, even if triggering recompilation has a cost, specializations stabilize quickly [71], typically incurring only minor overhead. By leveraging a state-of-the-art dynamic compiler like Graal, DynQ can selectively compile single components of the query’s physical plan. In particular, each table-scan executable node can be selectively compiled to self-contained machine code. Thanks to this approach, a query does not need to be fully compiled to machine code to achieve high performance, since, e.g., executing a join operator could lead to the evaluation of one child node in the interpreter (if it has few elements) and another child in compiled machine code.x
and y
properties have either type Int or Date). As the figure shows, all the calls to the interface methods are aggressively inlined by the compiler. The operations listed at the beginning of the while
loop that interact with the host dynamic language (i.e., reading the current array element and its properties x
and y
) are inlined by the compiler as well. Moreover, the predicate node is compiled into two if
statements that check whether in the current row the fields x
and y
have one of the expected types. If this is not the case, in general, the generated code would be invalidated as described above, while in this specific example, since there is no other specialization in the less-than node, the current row is discarded and the generated code does not need to be invalidated.
3.3 DynQ providers
3.4 Language-specific type conversions
LocalDate
instances once shared among different languages, but the internal representation in a specific language may be different, e.g., in JavaScript dates are represented as long values, as the number of milliseconds from the epoch day January 1, 1970-01-01, UTC [13]. As an example, consider the following simple query:
LocalDate
instance for the constant date (2000-01-01), then during predicate evaluation, for each row:-
It would check that the current row contains the field
X
and that it is actually a date instance (this step cannot be avoided in the context of dynamic languages). -
It would convert the JavaScript date into a
LocalDate
instance. -
Finally, it would compare the converted
LocalDate
instance with the constant one (2000-01-01).
3.5 Fluent APIs in DynQ
evenSquares.ToArray()
is called to materialize the query result into an array. From now on, we will refer to this syntax as fluent APIs. As an example of fluent APIs usage with DynQ, Fig. 8 shows how the de-sugared LINQ query in the example above can be executed with DynQ.
4 Caching compiled queries
Approach | Calls de-virtualization | Reachability of a stable state |
---|---|---|
per-query compilation | \(\checkmark \) guaranteed (all calls) | \(\times \) never (by design) |
hot-code compilation | \(\times \) best-effort (all calls) | \(\checkmark \) most of the applications |
reusable compiled queries | \(\checkmark \) guaranteed (subset of calls) | \(\checkmark \) most of the applications |
\(\times \) best-effort (remaining calls) |
4.1 Explicit parametricity
ExecutableNode
that accept parameters. In particular, when a query is prepared, DynQ generates an equivalent AST as discussed in the previous sections. During the AST generation, when DynQ encounters a query variable, i.e., the question mark symbol, it creates an expression node which acts as a placeholder for a value to be bound at query execution time. During the execution of a prepared statement, DynQ binds the placeholders to their values which are retrieved from the local scope of the currently executing query, i.e., its stack frame. Thanks to this approach, once the AST generated from a prepared statement is compiled by the JIT compiler, all subsequent invocations of the prepared statement will execute the same compiled code. Note that, similarly to the case of executing a query without bound parameters, prepared statements may be subject to re-compilation, too. However, since prepared statements are designed to be executed multiple times, re-compilation could take place among different executions. In particular, re-compilation could take place if the types of the prepared statements’ parameters change among the different invocations of the same prepared statement, as DynQ would need to generate new machine code specialized for different types.x< y
in Fig. 3 is now x< ?
. Figure 10 shows the AST generated from the prepared statement in Fig. 9. As expected, the AST is very similar to the one shown in Fig. 4, the only difference is the node PlaceHolder$0
, i.e., the placeholder for the prepared statement variable, which replaces the node ReadMember(y)
. Note that also the compiled code for the AST depicted in Fig. 10 is very similar to the one shown in Fig. 6, with the only difference that the generated function now takes a parameter to be compared with the property x
of each row.DynQ.par
, as well as a special operator: prepare
. The parameter marker DynQ.par
acts as the question mark symbol in prepared statement. However, in contrast to prepared statements, parameters are not limited to placeholder replacements for raw values, since placeholder nodes in the fluent API can represent any UDF. When a UDF is provided as argument to an operator appended into a pipeline-builder, e.g., .map(x => x*x)
, as with per-query compilation DynQ forces the inlining of the UDF into the query code, de-virtualizing the call to such UDF. On the other hand, when the marker DynQ.par
is used to indicate that a parameter will be provided at query execution time, DynQ introduces a virtual call into the generated code pointing to the placeholder location, where DynQ will place the reference to the UDF provided at query execution. Consider again the DynQ fluent API example in Fig. 8, which selects the squares of the even numbers in a given array. Figure 11 shows an example of parametricity defining a similar query which is parametric for the predicate expression. Such a parametric fluent API can be later invoked by passing (as parameter) an arbitrary UDF as predicate expression. Indeed, as the figure shows, the same compiled query can be used to evaluate the squares of even numbers as well as the squares of the numbers which pass any given predicate, e.g., the odd numbers.
4.2 Reusable compiled queries
ConsumerNode
, i.e., consume(row)
and getResult()
. We avoid forcing inlining of calls to nodes of type Expression
, leaving the inlining decisions of expression nodes to the underlying JIT compiler (i.e., Graal), as done with all methods during the execution of an application on a VM with hot-code compilation. Consider again the even-squares example query in Fig. 8, the sequence of operators which composes such a query starts with a source table scan operator, followed by a predicate, a projection, and finally a sink toArray
operator which materializes the rows into an array. The representation of this query partially specialized by de-virtualizing the calls to nodes of type ConsumerNode
is equivalent to the following parametric fluent API.
partiallySpecialized
, passing as parameters xs
, x => x % 2 == 1
and x => x * x * x
.root.query = DynQ.prepare()
.
ExecutableNode
through the parametric fluent API instance in the tree node and cache it in the tree node itself. Otherwise DynQ reuses the already generated (i.e., cached) ExecutableNode
. Note that the generated AST does not contain any expression node but parameters, since all actual parameters provided by the developer are internally stored in the pipeline-builder instances and replaced with DynQ.par
within the generated executable node, making that node reusable for any other query composed of the same sequence of operator. Once the ExecutableNode
has been retrieved (either cached or freshly created) it is automatically invoked by passing as parameters the array \(P_1.actualParameters\) (line 54 in Fig. 12). It is important to note that reusable compiled queries do not prevent additional speculative compiler optimizations, e.g., the compiler could decide to speculatively inline a UDF in a query as in hot-code compilation.null
. Reusable compiled queries can be seen as an optimized data-processing-specific code cache for fluent API which is able to detect similar queries and reuse their compiled representation.filter
operation) can be de-virtualized by the VM once the generated code for DynQ’s fluent API is inlined and optimized in the context of the caller.5 Evaluation
-
Queries consisting of selection and aggregation (without group by), leading to a single row (i.e., queries 1, 2, 3).
-
Queries consisting of selection, projection, which return a list of rows (i.e., query 4), with also a limit operator (i.e., query 6) and with both sort and limit (i.e., query 5).
-
A query consisting of selection and join, followed by an aggregation operator (i.e., query 7).
5.1 Evaluation plan
data.table
API, DuckDB [53], and MonetDB [23]. In this setting, we import the TPC-H tables into R data frames. Since TPC-H is based on a strict (relational) schema, and the data is imported into R data frames, which is a typed data structure, the evaluation on R does not highlight the DynQ peculiarity of efficiently accessing data with unknown schema. Indeed, for all the experiments in this setting, DynQ uses the schema information from the data frames. However, DynQ currently uses the schema only for the data-access operations, all the query operators nodes as well as the other expression nodes share the same implementations as in the case of unknown schema, as described in Sect. 3.4. The main goal of this evaluation is to show that on relational database workloads the flexibility of DynQ in accessing data formats which are not directly managed by the query engine does not impair performance, in contrast to other data-processing systems.5.2 R benchmarks
data.table
API, which is arguably the most efficient library for processing R data frames. The benchmark results are depicted in Fig. 14. As the figure shows, DynQ is slower than the data.table
API only on MQ7 and outperforms it on all other queries by speedup factors ranging from 1.16x (MQ4) to 27.8x (MQ5). The speedup on MQ6 against data.table
(i.e., 826x) is because DynQ chains query operators and stops the computation once it finds the first 1000 elements that satisfy the predicate (i.e., the limit operator). On MQ6, DynQ performs comparably with DuckDB, with speedup factors of about 1.18x against DuckDB(df) and 0.63x against DuckDB(preload), with a query execution time of about 1ms, showing the effectiveness of our exception-based approach for implementing early exits for the LIMIT operator. We consider such a low query execution time a great achievement for DynQ, since the existing query engines based on compilation commonly suffer from a latency overhead due to query compilation. DynQ outperforms DuckDB(df) in all other queries as well, with speedup factors ranging from 4.38x (MQ7) to 48.14x (MQ1). Those speedups against DuckDB(df) are motivated by the fact that the micro-benchmark queries are simple and mostly dominated by table scans. DuckDB(df) requires table-scan operations to convert data on-the-fly from data frames into the DuckDB physical data representation, which introduces high overhead. On the other hand, DynQ can execute queries on R data frames in-situ, i.e., without any conversion. Indeed, DynQ performance is closer to DuckDB(preload), which significantly outperforms DuckDB(df), showing that the great flexibility of DynQ in accessing data in different formats does not impair performance. In particular, DynQ is slower than DuckDB(preload) only on queries MQ6 (factor 0.63x) and MQ7 (factor 0.65x), and outperforms DuckDB(preload) on all other queries, with speedup factors ranging from 1.11x (MQ1) to 1.77x (MQ3).
5.3 JavaScript benchmarks
5.3.1 Evaluation on AfterBurner
5.3.2 Evaluation on object arrays
data.table
API in R, also Lodash does not chain the filter with the limit operation, unlike DynQ. Moreover, DynQ performance are comparable with the hand written implementations in most of the queries. In particular, DynQ is slower than the hand written implementations only on MQ2 (0.91x), and faster on MQ6 (2.46x) and MQ7 (2.04x). There are multiple reasons why DynQ is able to outperform the handwritten queries. First, the JavaScript semantics may enforce additional operations which are not required in data processing; as an example, JavaScript’s Map performs hashing by converting each value into a string representation. Moreover, during the execution of handwritten queries, the JavaScript engine needs to perform more runtime checks than DynQ. Besides performance, the implementations using DynQ are the most concise ones. In particular, the handwritten implementations of the micro-benchmark queries count 160 lines of code (LOC), the Lodash implementations count 58 LOC, and the DynQ implementations count 40 LOC.
-
findByState
: finds the first location which matches a given state name. -
findByCityAndState
: finds all the locations which match a given city name and state name. -
zipLookup
: finds the first location which matches a given zip code. -
gpsLookup
: finds the closest location of a given point (by latitude and longitude).
gpsLookup
API, since the original version is already hand-optimized and it does not use any third-party data-processing library. For evaluating DynQ on the gpsLookup
API, we use two versions; one version (DynQ (JS UDF)) uses the JavaScript module haversine
as UDF for calculating the distance between two points, while the other version (DynQ (Java UDF)) uses a Java UDF instead of the JavaScript one. We manually implemented the Java UDF by carefully replicating the JavaScript version, such that the executed algorithm is exactly the same.
findByState
accepts the name of a state) we implemented those API in DynQ leveraging prepared statement introduced in Sect. 4.1. As an example, the following is the DynQ implementation of the findByState
API.
gpsLookup
API. Since for the latter experiments we use DynQ in two different ways (i.e., implementing the UDF in JavaScript and Java), Fig. 23 shows (above the bars of those two implementations) their respective speedups against the original implementation. As the figures show, DynQ outperforms both Lodash and the hand-optimized implementations in all API. Moreover, the evaluation of the gpsLookup
API shows that evaluating an UDF with DynQ does not introduce any overhead when the UDF is implemented in the host dynamic language (i.e., JavaScript). This is expected, since, as discussed in Sect. 3, GraalVM can inline the machine code generated from the JavaScript UDF within the query execution code. Moreover, when the UDF is implemented in Java, performance improves, i.e., we measure a speedup factor of 1.69x. This is expected, since executing the JavaScript UDF requires more type checks than executing the UDF in Java. Our evaluation on existing codebases shows that, besides data analytics, DynQ is also a promising library for server-side Node.JS applications that perform in-memory data processing.
5.3.3 Reusable compiled queries
flatMap
operator and Lodash implements that operator by materializing intermediate results. Our JavaScript implementations of the query operators are similar to the ones in Lodash, but the flatMap
operator is implemented without materializing intermediate results, as in DynQ, which explains the performance improvement of our JavaScript implementations w.r.t. Lodash. Although the JavaScript implementation of the query operators is conceptually very similar to those in DynQ, leveraging explicit parametricity leads DynQ to a speedup of 1.57x against the JavaScript implementation. Also reusable compiled queries outperform the JavaScript implementation, since, as discussed in Sect. 4.2, reusable compiled queries can guarantee that the sequence of operators composing a pipeline is fully de-virtualized. Finally, we note that explicit parametricity leads to a speedup of 1.15x in comparison with reusable compiled queries, a higher speedup w.r.t. the one observed on Scrabble. This is expected, since Mnemonics is a function which, for the benchmark input, executes 338 recursive calls for a single run of the benchmark, so the overhead of reusable compiled queries w.r.t. explicit parametricity is amplified in comparison to Scrabble, since each recursive call involves a query execution.