Das Kapitel beginnt mit einer Skizze der üblichen Szenarien, in denen eine Abfrageübersetzung zwischen verschiedenen SQL-Engines erforderlich ist, wie etwa Datenbankmigration und Prototyping. Anschließend vertieft er sich in die semantischen Diskrepanzen im Zusammenhang mit dem Tippverhalten, die zu unerwarteten Ergebnissen oder Fehlern während der Ausführung der Abfrage führen können. Der Kern des Kapitels stellt ein formales Rahmenwerk vor, das so genannte "Framework", das darauf ausgelegt ist, sowohl gängige als auch abweichende Verhaltensweisen innerhalb von SQL-Datenbankmaschinen zu modellieren. Dieses Rahmenwerk umfasst ein minimales Kernfragment der relationalen Algebra, das essentielle Operatoren und Typen-Casts unterstützt, und ist flexibel genug, um verschiedene SQL-Engines unterzubringen. Das Kapitel behandelt auch die vier aufeinander folgenden Phasen des Rahmenwerks: Übersetzung, Schriftprüfung, Gusseinfügung und Bewertung, die jeweils eine entscheidende Rolle beim Verständnis und bei der Behebung von Schreibfehlern spielen. Darüber hinaus bietet das Kapitel empirische Validierung durch die Entwicklung mehrerer Datenbank-Interpreter und deren Tests mit realen Engines, was die Auswirkungen von Abfrageoptimierungen auf den Bewertungsprozess beleuchtet. Die detaillierten Analysen und praktischen Beispiele machen dieses Kapitel zu einer unschätzbaren Ressource für jeden, der die Komplexität von Typkonvertierungen in SQL-Engines bewältigen möchte.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
Practical SQL engines differ in subtle ways in their handling of typing constraints and implicit type casts. These issues, usually not considered in formal accounts of SQL, directly affect the portability of queries between engines. To understand this problem, we present a formal typing semantics for SQL, named \(\textsf{TRAF} \), that explicitly captures both static and dynamic type behavior. The system \(\textsf{TRAF} \) is expressed in terms of abstract operators that provide the necessary leeway to precisely model different SQL engines (PostgreSQL, MS SQL Server, MySQL, SQLite, and Oracle).
We show that this formalism provides formal guarantees regarding the handling of types. We provide practical conditions on engines to prove type safety and soundness of queries. In this regard, \(\textsf{TRAF} \) can serve as precise documentation of typing in existing engines and potentially guide their evolution, as well as provide a formal basis to study type-aware query optimizations, and design provably-correct query translators. Additionally, we test the adequacy of the formalism, implementing \(\textsf{TRAF} \) in Python for these five engines, and tested them with thousands of randomly-generated queries.
Query translation between different SQL engines is a common practice arising in different scenarios, like database migration (to reduce costs, maintenance, changes in software, etc.) [6, 20, 22] and prototyping (code in lightweight databases like SQLlite and then port to a more robust database). Today there are many tools addressing this task [1, 18].
Translation between SQL engines could bring many surprises. One that particularly captures attention is the semantic discrepancy between SQL engines related to typing behavior, the problem we study in this paper. This problem is mainly due to differences in datatypes, type checking, when to perform the checks, and explicit and implicit type casts that may or may not be performed by the engines. This poses substantial challenges for developers and database administrators. Existing migration tools, whether paid or open source, often fall short of addressing these differences, tending to prioritize syntax over behavioral disparities.1
Anzeige
For illustration purposes, consider a table . In the query , the engine \(\textsf{PSQL}\) reports a static error and \(\textsf{Oracle}\) a runtime error (in both engines, addition is not defined for strings); \(\textsf{MSSQL}\) interprets the operation as string concatenation, yielding ; and \(\textsf{MySQL}\) and \(\textsf{SQLite}\) yield . On the other hand, the query shows that \(\textsf{MySQL}\) and \(\textsf{SQLite}\) do not always exhibit identical behavior: the first yields , while the other an empty result. In Table 1 we present further (minimal) examples of this wide difference in behavior, which are explained in detail in Section 2.
As these examples show, database engines have different treatment of types, following different design models, like some being statically typed while others embrace dynamic typing, and different approaches to overloading basic arithmetic and comparison operations.
Understanding and addressing these anomalies is not a simple task. A first issue is that SQL standards do not cover many issues related to typing or leave the interpretation rather open.2 In addition, many of the design decisions of the engines are hidden under optimization mechanisms that either are not public or complex to find in the code. Furthermore, the problem has not been addressed by the research literature. Although there is solid work on the formalization of the semantics of SQL [19, 31], they assume that all comparisons and operations apply to the right types. Typing in SQL have been explored [3, 13, 24, 26, 32]. Nonetheless, the problem of type constraints and casts potentially raising errors at runtime, which is addressed in this paper, has not been dealt with before.
In this paper we address the problem of discrepant type related behavior by proposing a general formal framework, called \(\textsf{TRAF} \), that models both common behavior and the intricate behavioral discrepancies across SQL database engines. \(\textsf{TRAF} \) was designed to explicitly capture the semantics of types, both static and dynamic, of a core fragment of relational algebra. The selected minimal core (already presented in other works like [8, 19, 27]) is designed to include the minimal features and operators that generate the indicated anomalies, as well as being flexible enough to model different SQL engines. It comprises booleans, numbers, selections, cross-products, nested queries within clauses, and set operations. We also add support for arithmetic operators and type casts. Although features such as nulls, aggregation, or , are not supported within this core, it serves to illustrate the primary distinctions, leaving the extension to these additional features as potential future work.
Anzeige
We study two kinds of different behavior among engines: (1) different results due to type conversions, and (2) different behavior related to type errors. Regarding the latter, we identify two kinds of type errors: static errors, which happen during compilation/before running the query; and runtime errors that occur during the execution of the query. We say that two engines differ in behavior if, for a given query, the results are different (including different kinds of errors).
\(\textsf{TRAF} \) involves four sequential phases as illustrated in Figure 1: translation, typechecking, cast insertion, and evaluation. In the translation step (which is standard and can be found in the extended version of the paper), an SQL query is (1) analyzed to rule out syntactically invalid queries such as and (2) transformed to a \(\textsf{TRAF} \) query, that is, essentially a typed relational algebra query. Next, in the typechecking phase, a typechecker validates the query before evaluation (Section 3.3).3 This typechecker is responsible for identifying mismatches between types and the number of columns of subqueries in set operations, and to ensure proper access to column names in scope. More importantly, it validates the appropriate usage of both implicit and explicit casts, rejecting operations that may result in casts known to always fail during evaluation. If the typechecker rejects the query, the process terminates and reports a static type error to the user. However, if the query is deemed well-typed, it proceeds to the third phase.
Fig. 1.
Overview of \(\textsf{TRAF} \). Some abstract operators in the typechecking, cast insertion, and evaluation phases must be instantiated to implement a particular engine. We provide sample instantiations for \(\textsf{PSQL}\), \(\textsf{MSSQL}\), \(\textsf{Oracle}\), \(\textsf{MySQL}\) and \(\textsf{SQLite}\).
The cast insertion phase (Section 3.4) transforms a \(\textsf{TRAF} \) query by making all implicit type casts explicit. The purpose of this phase is to circumvent the complexities that would arise from handling implicit casts in the evaluation phase, thereby simplifying the dynamic semantics of \(\textsf{TRAF} \). The evaluation phase (Section 3.5) interprets the transformed \(\textsf{TRAF} \) query. For this purpose, we present a monadic evaluator expressed using denotational semantics, cleanly encompassing the management of explicit type conversions and runtime errors. Should a type error arise during evaluation, the evaluation process is halted, and a runtime type error is reported. In the absence of runtime type errors, the outcome of evaluating a query is a table.
\(\textsf{TRAF} \) is designed to be flexible enough to model different real-world engines (\(\textsf{PSQL}\), \(\textsf{MSSQL}\), \(\textsf{Oracle}\), \(\textsf{MySQL}\) and \(\textsf{SQLite}\)) in order to facilitate understanding of different behaviors and thus enable informed decisions when translating queries. Specifically, typechecking, cast insertion, and evaluation are parameterized in terms of abstract operators that need to be instantiated (Section 4).
The formal model is coherent and comprehensive, making it possible to precisely formulate and prove properties satisfied by all or some engines (Section 5). First, we prove a type safety result that ensures that well typed queries either reduce to a table or raise a (controlled) type error. In other words, evaluation of well-typed queries does not get stuck. Second, we enhance this result proving that the resulting table can be typed to the same type of the query. Third, the instantiation of the model to \(\textsf{PSQL}\), \(\textsf{MySQL}\) and \(\textsf{SQLite}\) satisfies a theorem stating that if the programmer does not use explicit casts in well-typed queries, then the queries will evaluate without errors. Fourth, regarding cast insertion, independently of the engine (the proofs are parameterized by light constraints on abstract operators), translated queries preserve types, and the translation is unique.
Finally, to validate the adequacy of our formal framework, and following other approaches that also validate semantics by testing them against real-world implementations [19, 29, 30], we developed multiple database interpreters, and tested them with thousands of randomly-generated queries, comparing their results with those obtained from actual database engines. Additionally, we shed light on the impact and challenges of query optimizations on the evaluation process.
In summary, our contributions includes the identificacion of practical semantic discrepancies due to typing between SQL engines; the description of \(\textsf{TRAF} \), a formal framework to reason about types both statically and dynamically, which can model different SQL engines; the metatheory of \(\textsf{TRAF} \), indicating precise constraints on abstract operators required for properties to hold; and an empirical validation of our formal model, by building several interpreters tested against real engines. Additional material, including proofs and full definitions, can be found in the extended version. Also, we made available a prototype implementation as supplementary material.
The rest of the paper is organized as follows. Section 2 explains the discrepancies illustrated in Table 1. Section 3 presents the \(\textsf{TRAF} \) formal framework. Section 4 describes the instantiation of \(\textsf{TRAF} \) to practical SQL engines. Section 5 presents and proves formal properties of the model and its instantiations. Section 6 summarizes the experimental validation. In Section 7 we discuss related work, and Section 8 presents brief conclusions.
2 Typing semantic discrepancies
Table 1.
Examples of discrepant behaviors of different database engines, considering the table . The queries are purposely chosen to exhibit minimal cases of typing issues. For simplicity we show only one element of the result. We use to denote a static error (before execution), and only one element of the result. We to denote a runtime error.
Dummy
In this section we explain the source of discrepancies of Table 1, which we use as a starting point and develop (and justify) \(\textsf{TRAF} \). For clarity, we categorize the examples of discrepancies into three groups: arithmetic, boolean, and set operations.
2.1 Arithmetic operations
For the first set of examples we focus on the expression being selected rather than the tables or the conditions.
E1andE2. To begin, we present two examples that behave uniformly across the engines. For simplicity, we say that the result is 2.1, to denote a bag of uniform elements \(\{2.1, 2.1, 2.1\}\).
E3. This is the first example that illustrates a difference in behavior. \(\textsf{PSQL}\) and \(\textsf{MSSQL}\) throw a type error, whereas the other engines return 2.1. The reason for the error is that in \(\textsf{PSQL}\) and \(\textsf{MSSQL}\) we can implicitly cast a string to an integer if the string is an integer, but not if the string is a real number. To fix this problem in \(\textsf{PSQL}\), we must explicitly cast the string to a real number: .
E4. If we take example 3, and change for a real number, such as , then the result is now 2.2 for every engine. The difference with respect to the previous example is that the plus operation is defined for both integers and real numbers, and now can be cast directly to a real number.
E5.\(\textsf{PSQL}\) reports an static error due to the lack of an addition operator between two strings. \(\textsf{MSSQL}\) returns as addition is overloaded for string. The other engines return 2 as addition is only defined between numbers, thus implicit casting to .
E6. Both \(\textsf{PSQL}\) and \(\textsf{Oracle}\) report an error, \(\textsf{MSSQL}\) uses string concatenation yielding , and \(\textsf{MySQL}\) and \(\textsf{SQLite}\) yield . The reason for the latter is due to the way these engines cast strings to numbers: they search for a number in the prefix of the string (if nothing is found then 0 is returned).
E7.\(\textsf{PSQL}\) rejects this query as column is of type String, and the addition between numbers and strings is not defined. This behavior is more conservative than E2, as now it cannot determine statically if the given string can be cast to number or not. Other engines defer the check to runtime and return . To fix this query in \(\textsf{PSQL}\) we can explicitly cast column to integer: .
E8.\(\textsf{PSQL}\) (statically) rejects this query similarly to E7. \(\textsf{MSSQL}\) and \(\textsf{Oracle}\) now fail dynamically as they cannot convert to a number. \(\textsf{MySQL}\) and \(\textsf{SQLite}\) on the other hand do not fail and return as is cast to . Casting to in \(\textsf{PSQL}\) would make the error dynamic, and a runtime type error would be reported instead, similarly to \(\textsf{MySQL}\) and \(\textsf{SQLite}\).
E9. Contrary to E2, \(\textsf{PSQL}\) raises a static error as the nested query hides the actual String returned by the subquery. All other engines yield as conversions are optimistically performed at runtime. To fix this query in \(\textsf{PSQL}\) an explicit cast must be inserted .
2.2 Boolean operations
The following examples illustrate difference in behavior related to comparison operators on conditionals. Note that 1 and 0 are used to represent true and false in \(\textsf{SQLite}\) and \(\textsf{MySQL}\).
E10. Almost every engine is able to cast , returning 1. \(\textsf{SQLite}\), on the other hand, returns a empty result as the condition is . This is because \(\textsf{SQLite}\) does not perform implicit conversions at the boundaries of comparisons. \(\textsf{SQLite}\) has a type hierarchy where every string is bigger than any number.
E11. Now, if the left operand is a string representing a real, then \(\textsf{PSQL}\) and \(\textsf{MSSQL}\) return a type error. This is because the best type of the comparison operator is the one that takes two integers as argument (because 2 is an integer). As there is no direct implicit conversion between a string representing a real and an integer the query is rejected. \(\textsf{SQLite}\) still returns an empty result, and \(\textsf{MySQL}\) and \(\textsf{Oracle}\) return 1.
Comparison operator in\(\textsf{SQLite}\).
Table 2.
Behavior of the comparison operator in \(\textsf{SQLite}\)
Dummy
\(\textsf{SQLite}\) warrants special attention to illustrate unique cases related to the comparison operator, as exemplified in Table 2. Notably, operations and yields . This behavior is attributed to the fact that strings are considered larger than integers, as previously explained. However, when a cast is introduced the expressions now yields . For instance, yields . Since when the type of one operand is explicitly specified, an implicit conversion to that type is performed on the other operand.
2.3 Set operations
E12.\(\textsf{PSQL}\), \(\textsf{MSSQL}\) and \(\textsf{MySQL}\), yield . This is because, during intersection, two conditions are checked: (1) the number of columns of both subqueries must match, and (2) the types of the columns must also be consistent. To achieve the second condition, an implicit conversion from string to real is inserted in the left subquery (the other direction is forbidden). However, \(\textsf{Oracle}\) encounters a type error since the column types do not match. On the other hand, \(\textsf{SQLite}\) returns an empty result as is not the same as .
E13. Now as cannot be implicitly cast to an integer (the column type of the right subquery), this program is rejected by \(\textsf{PSQL}\), \(\textsf{MSSQL}\) and \(\textsf{Oracle}\). Both \(\textsf{MySQL}\) and \(\textsf{SQLite}\) return an empty result.
Having illustrated the various discrepancies between SQL engines in handling queries, it becomes evident that a more structured approach is necessary to fully understand and model these differences. To achieve this, we now turn to the formalism of \(\textsf{TRAF} \).
3 The (\(\textsf{TRAF} \)) formal framework
In this section we present the syntax, type system, cast insertion, dynamic semantics, and translation of a typed core fragment of relational algebra, supporting projections, selection, set operations, arithmetic and boolean operations, and implicit and explicit casts. The formalism captures common behavior between engines while providing leeway to model different concrete engines, and thus is parametrized by abstract operators (detailed and instantiated in Section 3.2).
Fig. 2.
Syntax of \(\textsf{TRAF} \). n, b, s denote respectively an integer number, a boolean value and a string. N is a name. R a relation.
3.1 Syntax
The syntax of Typed Relational Algebra Framework (\(\textsf{TRAF} \)) is presented in Figure 2. The formalization is inspired by the work of Guagliardo and Libkin [19], except that here we deal with typing instead of assuming a prior (unstudied) typing phase. Types play a central role in this work, because as we illustrated, typing discrepancies are a source of important behavioral differences between engines. This section first presents types and schemas, then values and expressions, and finally queries.
Types and Schemas There are two categories of types, value types\(\tau \) for expressions (and values), and relation types\(T\) for queries and tables (relations). For simplicity, we only consider reals \({\mathbb {R}}\), integers \({\mathbb {Z}}\), booleans \({\mathbb {B}}\) and strings \(\textsf{String}\). To avoid dealing with precision issues inherent in floating-point representations, we use the abstract type \({\mathbb {R}}\) to represent decimal numbers. In addition, we use the symbol ? for the unknown type, used by \(\textsf{PSQL}\) to type string literals [17], and to model flexible typing in \(\textsf{SQLite}\). A relation type \(T\) is an ordered list of pairs of column names and their corresponding value type. A schema \(\varGamma \) is a list of pairs of relation names and their relation type. We assume that both \(T\) and \(\varGamma \) do not contain duplicated column names and relation names, respectively (and thus behave as mappings). Intuitively, schemas represent types of databases.
Values and expressions We represent tables as bags of rows, where each row is a list of values. A simple value\(w\), which represent atomic data (integers, booleans, strings). Values \(v\) are either a simple value \(w\), or a simple value cast to the unknown type \(w {:}{:}\, ? \) (the latter is not used directly by programmers, and it is used exclusively in engines such as \(\textsf{SQLite}\)). There are three kinds of expressions: general, boolean and aliased. A general expression e, used in projections and selections, is either a column name \(N\) (e.g. \(\textsf{Name}\) or \(\textsf{Age}\)), a value \(v\), an arithmetic operation (for simplicity we only use \(+\)), or an explicit cast \(e {:}{:}\, \tau \). Boolean expressions \( \theta \) as usual are comparison operations or logical combinations of them.4 As a standard, an aliased expression is a slight extension of the classical renaming, allowing binding of names to expressions, instead of only to queries. For example, in , the expression gets a name that can be used in the outer query.
Queries A query is either a relation \(R\), a projection , a selection , or a set operation, that is, cross product (\(Q \times Q\)), intersection (\(Q \cap Q\)), union (\(Q \cup Q\)), difference (\(Q - Q\)), and the removal of duplication (\(\varepsilon {(Q)}\)). The only novelty is that a projection here is parametrized by a list of aliased expressions, instead of a list of names. This allows to model SQL queries like
as \(\pi _{({1}+{1})\texttt {as}{C}}(R)\).
Rows and Tables A table \(t \) is a bag of rows\(\{\!\!\{r_{i}\}\!\!\}\), where a row is an ordered list of values \(\overline{v_{i}}\).
3.2 Abstract operators
As mentioned in Section 1, certain key operators in \(\textsf{TRAF} \) are left abstract, as they depend on the specific engine being used.We mark with (*) the partial operators.
Bidirectional Implicit Cast(*). Operator \(bisconv(e_{1},{\tau _{1}},{e_{2}},{\tau _{2}}) = \tau _{3}\) determines the optimal implicit type cast for a set operator applied to two columns of (possibly) different types. The \( biconv \) operator takes as argument two expressions (\(e_{1}\), \(e_{2}\)) and their corresponding types (\(\tau _{1}\), \(\tau _{1}\)), returning a type. It either returns \(\tau _{1}\) or \(\tau _{1}\) by testing if \(e_{1}\) can be implicitly cast to \(\tau _{2}\), or if \(e_{2}\) can be implicitly cast to \(\tau _{1}\) respectively.
Explicit Cast(*). Operator attempts to cast value \(v\) to a value \(v'\) of type \(\tau \). This function is primarily used to evaluate casts at runtime.
Overloading Resolution(*). Operator \(resolve({e_{1}},{\tau _{1}},{e_{2}},{\tau _{2}}{\mathbin {\texttt{O}}}) = \tau _{3}\) determines which specific operation among a set of overloaded ones should be called based on the provided arguments during invocation. More specifically, givenan operator it tries to find the best candidate type for expressions \(e_{1}\) and \(e_{2}\), typed as \(\tau _{1}\) and \(\tau _{2}\) respectively.
Explicit Cast Feasibility. This operator takes one expression and two types, and returns a boolean. For simplicity, it is presented as a relation , and rules out explicit casts that are known to fail at runtime. It tests if it is possible to explicitly cast expression \(e\) of type \(\tau '\), to an expression of type \(\tau \).
Type of Values. Operator \(ty({v}) = \tau \) computes the type of values (constants).
Type Cleaning. The operator \(clean({T}) = T'\) performs post-processing on the relation type \(T\), returning a new relation type \(T'\). This is primarly used in \(\textsf{PSQL}\), where literal strings are initially typed as \(?\), a type that is later converted to \(\textsf{String}\) when determining the type of a subquery.
Annotation Insertion.The operator \(inserts({e},{\tau },{\mathbin {\texttt{O}}}) = e'\) returns either an explicitly cast expression \(e\) to type \(\tau \) or simply \(e\), depending on the operation \(\mathbin {\texttt{O}}\). For instance, in SQLite, comparison operations do not implicitly cast their operands, whereas addition operations do perform such casts.
Value Operation Application. Operator \(apply({\mathbin {\texttt{O}}},{v_{1}},{v_{2}}) = v_{3}\) performs arithmetic or comparison operation \(\mathbin {\texttt{O}}\) to values \(v_{1}\) and \(v_{2}\), yielding value \(v_{3}\).
In the next subsections, we use these operators to define the dynamic semantics, the type system and the cast insertion procedure. Examples of instantiation of these abstract operators can be found in Section 4.
Fig. 3.
Type System of \(\textsf{TRAF} \). Abstract operators are highlighted in gray.
3.3 Type System
The type system of \(\textsf{TRAF} \) is presented in Figure 3, and in the following, we briefly explain the rationale behind each rule. Boxes in gray indicate abstract operators, and we can provide specific implementations for modeling an engine (Section 3.2).
Expressions. Rule (T\(v\)) assigns types to values based on the specific database engine (the \( ty \) operator in the grey box). For instance, in \(\textsf{PSQL}\), integers are typed as \({\mathbb {Z}}\) and literal strings as \(?\), but in \(\textsf{SQLite}\), both are typed as \(?\). Rule (T\(N\)) assigns the type \(\tau \) to the column name \(N\) if \(N\) is mapped to \(\tau \) in the relation type \(T\).
Rule (T : : ) assigns the type \(\tau \) to an ascription \(e {:}{:}\, \tau \) if the following conditions are met: (1) the expression \(e\) must be well-typed for some \(\tau '\); (2) the explicit cast of \(e\) to \(\tau \) (engine-dependent, thus grey box) is checked to rule out explicit casts that are known to always fail at runtime. For example, the attempt to cast to \({\mathbb {Z}}\) ( ) is rejected in \(\textsf{PSQL}\), while casting to \({\mathbb {Z}}\) is accepted in both \(\textsf{PSQL}\) and \(\textsf{SQLite}\).
Rules (T\(\mathbin {\texttt{O}}\)), (T\(\mathbin {\mathtt {O_{B}}}\)) and (T\(\lnot \)) type operations. (T\(\mathbin {\mathtt {O_{B}}}\)) and (T\(\lnot \)) type boolean operations as usual. The interesting case is rule (T\(\mathbin {\texttt{O}}\)) that types arithmetic and string operations. Based on the types of \(e_{1}\) and \(e_{2}\) and operation \(\mathbin {\texttt{O}}\), the (T\(\mathbin {\texttt{O}}\)) rule searches for the best candidate type signature for the given operation, taking into consideration that an operation might be overloaded with multiple types. It involves the operations \( resolve \) and \( ty \) that are engine-dependent. If a single candidate function type is identified , the expression is typed; otherwise, it is considered ill-typed. Note that types \(\tau _{3}\) and \(\tau _{4}\) do not need to coincide with \(\tau _{1}\) and \(\tau _{2}\) as arguments can be cast to different types. For instance, in the case of the query both \(\textsf{PSQL}\) and \(\textsf{SQLite}\) choose numeric addition (\({\mathbb {Z}}\times {\mathbb {Z}}-> {\mathbb {Z}}\)). However, when dealing with strings (e.g. ) \(\textsf{PSQL}\) rejects the query because it cannot choose a best candidate operator, but \(\textsf{SQLite}\) instantiates the query with numeric addition, implicitly casting both string arguments to integers.
Rule (T\(\beta \)) is used to type an aliased expression \(\beta \). We use the notation \(\overline{A_i}\) to denote a list \(A_1, \dots , A_n\). Under \(T\), every expression \(e_{i}\) yields a type \(\tau _{i}\). Subsequently, the type of the aliased expression as a whole is a relation type, where each name \(N_{i}\) is mapped to the type \(\tau _{i}\) of their corresponding subexpression \(e_{i}\). Additionally, we use metafunction \( unique (.)\) to ensure that names are unique.
Queries. Rule (T\(R\)) assigns a relation type to relation name \(R\) according to schema environment \(\varGamma \). Rule (T\(\pi \)) types \(\pi _{\beta }(Q)\) based on the type of Q and that of the aliased expression \(\beta \), which is typed under \(clean({T})\) to remove unknown occurrences. For instance, in \(\textsf{PSQL}\), runs successfully, but is rejected statically: the first query has type unknown \(?\), which can be cast to integer, whereas in the second query, the unknown type \(?\) is transformed to \(\textsf{String}\) disallowing the implicit cast to integer.
Rule (T\(\sigma \)) first typechecks the subquery, resulting in a relation type \(T\). Then, the condition must be successfully typed as boolean under the context of \(T\) (considering that the condition may reference columns from the subquery). Finally, the selection operation is assigned the same type as the subquery \(T\). Rule (T\(\times \)) types cross products using the concatenation of the relation types of both subqueries, ensuring that the sets of names in each subquery are disjoint. The list of column names of a query \(Q\) is extracted using function , and defined as:
Rule (*) follows standard engine usage of using the left schema of a set expression as output schema.
Rule (T\(\mathbin {\mathtt {O_{S}}}\)) deals with set operations (union, intersection, and difference) which require special attention.Usually, it is assumed that the names, number of columns, and types of the columns in the subqueries match. In practice, the column names do not necessarily match, but number of columns and types must align. To achieve this, many engines perform implicit casts between the columns of the subqueries to align their types. In particular, string literals are analyzed to check the plausibility of casts. For instance in \(\textsf{PSQL}\), runs successfully, resulting in a non-empty result, whereas is rejected before execution. To deal with these special cases, and without loss of generality, we require that both subqueries within a set operation must be projections in order to verify column casts. Therefore, the rule first typechecks both projections. Second, it examines whether the lists of aliased expressions can be implicitly cast between each other, taking their relation types into account. Finally, if this cast is feasible, the target relation type is captured and used to typecheck the whole set operation. To check casts between lists of aliased expressions, we use the \( biconv ^{*}\) operation defined as follows:
This operation returns a relation type, where each name \(N\) is mapped to the application of abstract operator \( biconv \) over each pair of expressions and their types. We select \(N\), the name of the left subquery, as it is a more commonly-adopted practice in various database engines. This partial operator determines the optimal implicit type cast for a set operator applied to two columns of (possibly) different types, returning one of the two types as a result. The expressions are provided to the function to rule out casts that are known to always fail during evaluation. For instance, \(\textsf{PSQL}\) accepts query , as both and can be cast to integers; but rejects
as cannot be implicitly cast to an integer (note that in \(\textsf{PSQL}\), implicit casts from integers to strings are not allowed).
3.4 Cast Insertion
Recall from Figure 1 that to avoid the complexity of dealing with implicit casts during runtime, before execution, in \(\textsf{TRAF} \) we transform each implicit cast to an explicit cast. For instance, a \(\textsf{PSQL}\) query is transformed to , which in \(\textsf{TRAF} \) corresponds to the elaboration from to .
Fig. 4.
\(\textsf{TRAF} \) Cast Insertion (excerpt). Abstract operators are highlighted in gray.
Figure 4 presents an excerpt of the explicit cast insertion rules; the complete rules can be found in theextended version. The rules are type directed, meaning that (1) we only elaborate well-typed terms, and (2) we use type information during elaboration. Like for typing, elaboration rules are defined inductively and grouped in three categories: for general and aliased expressions, and for queries. Most elaboration rules directly follow their corresponding typing rule. Judgment \(T |- e : \tau \leadsto e'\) represents that expression \(e\) typed as \(\tau \) under relation type \(T\) is elaborated to \(e'\). Judgment \(\varGamma |- Q : T \leadsto Q'\) is defined analogously.
Rule (E\(v\)) inserts an explicit cast to the type of each value. This is especially relevant for engines like \(\textsf{SQLite}\) where some values need to be tagged as unknown.
In \(\textsf{SQLite}\), all constants are considered to be of type unknown, unless the constant is fetched from a table. This is important at runtime, as a might behave differently than .
For example, the expression evaluates to 0. To facilitate this special case, the expression is elaborated to .
Rule (E\(\mathbin {\texttt{O}}\)) introduces explicit casts based on the operation and the best candidate type. Specifically, we insert casts for both operands to match their expected corresponding domain types and also insert a cast in the result of the operation to the expected codomain type. Certain engines, like \(\textsf{SQLite}\), do not insert explicit casts for comparison operations. To achieve this, we use the \( insert \) abstract operator.
Rule (E\(\mathbin {\mathtt {O_{S}}}\)) elaborates set operations by inserting casts to the output type of \( biconv ^{*}\). This is done using the \( insert ^{*}\) operation defined as
.For instance, for \(\textsf{PSQL}\), query is elaborated to .
Fig. 5.
Dynamic Semantics of \(\textsf{TRAF} \). Abstract operators are highlighted in gray.
3.5 Dynamic Semantics
Figure 5 presents the dynamic semantics of \(\textsf{TRAF} \), which differs from the ones of Guagliardo and Libkin [19] (GL) as follows. First, \(\textsf{TRAF} \) dynamic semantics are defined over a relational algebra, whereas GL uses SQL syntax to support extra features. Second, and more importantly, as some casts may be invalid, the dynamic semantics must deal with the possibility of runtime errors. For instance, in \(\textsf{PSQL}\) the query evaluates to an error.
To concisely account for the possibility of runtime errors, we present the dynamic semantics in monadic style in order to streamline the handling of errors [37].
The monadic presentation of computations that may fail consists in so-called “optional” values: either an actual value tagged with \({\textbf {ok}} ~\!\), or \({\textbf {error}} \) to denote an error. The sequential composition is given in a \(\textsf{do}\) block (e.g. ). Errors are transparently propagated through such sequences: the evaluation returns \({\textbf {ok}} ~v\) if all steps in a \(\textsf{do}\) block (e.g. A, B, and C) evaluate successfully, or \({\textbf {error}} \) if one step in the sequence evaluates to \({\textbf {error}} \).
There are four categories of evaluations : \(\llbracket Q \rrbracket _{\texttt{D},\varGamma }\) to reduce queries, to reduce expressions, to reduce aliased expressions, and to reduce boolean expressions. The evaluation of a query is parametrized by a database \(\texttt{D}\), which maps relation names \(R\) to tables \(t \). We use to denote that appears k times in \(t \). The other categories of evaluation are parametrized by an environment , which maps column names \(N\) to values \(v\). Intuitively, this environment is used to extract the value associated with a column name for a given row. For instance, consider a relation of persons , and a table . Then, for the first row, and .
Queries. The basic case is Rule (RR), which evaluates the name of a relation yielding the table associated to that name in database \(\texttt{D}\). Rule (R\(\mathbin {\mathtt {O_{S}}}\)) evaluates set operations by first reducing the subqueries, then combining the resulting tables. Rule (R\(\pi \)) starts by evaluating subquery \(Q\). If the result is successful (a table \(t \)), then for each row that appears k times in \(t \), we try to project columns as dictated by \(\beta \), and duplicate the result k times. To do this, we evaluate \(\beta \) under environment (similarly to [19], and ). This environment is formed by matching corresponding column names of \(Q\) with the values of . Finally, as the evaluation of aliased expressions can also produce errors, we lift the bag of optional rows \(\texttt{S}\) to an optional bag of rows using the \(\lceil \cdot \rceil \) function defined as: \(\lceil \texttt{S} \rceil = {\textbf {error}} \), when \({\textbf {error}} \in \texttt{S}\); and otherwise.
Rule (R\(\sigma \)) follows a similar approach by first reducing the subquery \(Q\). If the result is successful (yielding table \(t \)), we proceed to test the reduction of the condition \(\theta \) for each row. We employ a strategy akin to that of (R\(\pi \)), but in this case, the resulting bag of booleans is unused (binding the resulting in variable “_”). This way, in the third instruction, we can be confident that the evaluation of the conditions does not result in errors. Finally, we filter the rows from table \(t \) that satisfy the given condition. The last rule for queries (R\(\varepsilon \)) removes duplicates from subquery using the function \(\varepsilon (\cdot )\).
Expressions. Rule (Ras) evaluates a single aliased expression by evaluating the subexpression \(e\) and disregarding the name \(N\). Rule (R\(\beta \)) applies when we are evaluating multiple aliased expressions. Initially, it recursively reduces the sublist to a row r, and then the head of the list to a value \(v\), resulting in a new row \(r,v\).
Rule (R\(N\)) successfully evaluates a column name \(N\) to its corresponding value in . Rule (R\(v\)) successfully evaluates values to themselves. Rule (R\(v{:}{:}\)) attempts to cast a value into a value of a different type using the function. For instance, in \(\textsf{SQLite}\), the expression evaluates to , whereas in \(\textsf{PSQL}\) is not defined. If the function is defined for the given value and type, the resulting value is returned; otherwise, an error is raised. Rule (R\(e\)::) applies to subexpressions that are not already values. It first reduces the subexpression to a value, and then casts the value using rule (R\(v\)::).
Rules (R\(\mathbin {\mathtt {O_{B}}}\)), (R\(\lnot \)), and (R\(\mathbin {\texttt{O}}\)) operate in a similar fashion. First, each subexpression is reduced, then the resulting values are combined using the specific operation at hand. For arithmetic and comparison operations, the exact operation is performed using the apply operation. For instance, in \(\textsf{PSQL}\), the expression evaluates to true, whereas in \(\textsf{SQLite}\), it yields false.
4 Instantiating \(\textsf{TRAF} \)
In this section we illustrate how to instantiate \(\textsf{TRAF} \) for \(\textsf{PSQL}\) and \(\textsf{SQLite}\). The complete rules and the instantiations for three other engines (\(\textsf{MSSQL}\), \(\textsf{Oracle}\), and \(\textsf{MySQL}\)) can be found in theextended version. In general, an instantiation is achieved by providing specific definitions of the abstract operators that are engine dependent (the grey boxes in the figures). We obtained these definitions by exploring the documentation of each engine, and by conducting black-box analyses interacting with each engine whenever the documentation was lacking in details.
4.1 \(\textsf{TRAF}/\textsf{PSQL}\)
Fig. 6.
The \(\textsf{TRAF}/\textsf{PSQL}\) Instantiation (excerpt)
Figure 6 describes the \(\textsf{TRAF}/\textsf{PSQL}\) instantiation. \(\textsf{PSQL}\) is characterized by being a strongly-typed SQL engine, meaning that it is more conservative than the rest. In many cases, if it cannot check the feasibility of casts, it rejects the query before its execution.
For the type of values, reals are typed as \({\mathbb {R}}\), integers to \({\mathbb {Z}}\), and string literals to \(?\). The operator for removing unknown types replaces ? occurrences with \(\textsf{String}\), while leaving other types unchanged.5 The operator for inserting annotations adds explicit casts regardless of the operation’s type. The candidate types of a given operator is represented as sets of binary function types. Lastly, the semantics of operations between values are passed to the real implementation without any modifications.
Bidirectional implicit cast. Operator \( biconv \) is defined using the auxiliary implicit type cast relation \(e: \tau \leadsto \tau ' \). An expression \(e\) of type \(\tau \) can be cast to \(\tau '\) if either type \(\tau \) can be implicitly cast to \(\tau '\) (\(\tau \Rightarrow \tau '\)), or if \(e\) is a value \(v\) of type unknown \(?\) (i.e. a string) and that value can be implicitly cast to some value \(v'\) under type \(\tau '\) ( ).
Implicit type cast \(\tau \Rightarrow \tau '\) is only defined between identical types \(\tau \Rightarrow \tau \), or from integers to reals \({\mathbb {Z}} \Rightarrow {\mathbb {R}}\). For instance, is accepted and evaluates to \(\{1.0\}\), since \({\mathbb {Z}}\) (the type of ) can be implicitly cast to a \({\mathbb {R}}\) (the type of ). Implicit value cast is defined for extracting numbers from strings, but not viceversa.
Explicit cast. In addition to what an implicitvalue cast can do, explicit(value) casts support casts from real numbers to integers by removing the decimals, and from non-string values to strings by enclosing them in quotes. For instance, is accepted and evaluates to 1.1, but is rejected by the typechecker.
Overload resolution. The definition of the \( resolve \) operator is divided in four cases. The general case arises when the types of the two expressions are not known. Function bestCandidate is used to determine the best candidate. We model this function by initially calculating the sum of the type differences between the corresponding types of the expression and the domain of each candidate. The type difference \(\tau - \tau '\) quantifies the “cost” of changing type \(\tau \) to \(\tau '\): it yields a value of 0 if both types are identical, 1 when transitioning from an integer to a real or from a string to a number, and 2 in all other cases. If there are ties, and more than one type is selected, then the function is not defined. Furthermore, both types must satisfy the implicit cast relation with their corresponding type from the domain of the chosen best candidate. For instance, for query , the operation has two possible candidates: \({\mathbb {Z}} \times {\mathbb {Z}} \rightarrow {\mathbb {Z}}\), and \({\mathbb {R}} \times {\mathbb {R}} \rightarrow {\mathbb {R}}\). The best candidate in this case is \({\mathbb {R}} \times {\mathbb {R}} \rightarrow {\mathbb {R}}\), as both operands can be implicitly cast to reals. The remaining cases are analogous. When only one type is unknown, the best candidate function is applied using the other known type in both positions. Additionally, the expression of unknown type is implicitly cast to the chosen type to rule out potential errors. Finally, if both types are unknown, they are assumed to be strings when looking for the best candidate. For example, in , the best candidate type is \({\mathbb {R}} \times {\mathbb {R}} \rightarrow {\mathbb {R}}\), as the implicit cast from to real is possible. On the contrary, query is rejected by the typechecker because there is more than one candidate available (int and real versions).
Fig. 7.
The \(\textsf{TRAF}/\textsf{SQLite}\) Instantiation (excerpt)
4.2 \(\textsf{TRAF}/\textsf{SQLite}\)
\(\textsf{SQLite}\) is one of the most flexible SQL engines: type enforcement is not mandatory, and types on columns are optional. However, \(\textsf{SQLite}\) attempts its best effort to perform casts without ever raising a type error at runtime. To capture this kind of flexibility, we model the type of values as \(?\) and define all abstract operators as total functions. Figure 7 describes the \(\textsf{TRAF}/\textsf{SQLite}\) instantiation.
Statically, in \(\textsf{SQLite}\) the type of every value is unknown. We introduce a dynamic type operator \(dty({v})\) to obtain precise type information during the evaluation of comparisons. The type of a value, initially unknown, may be refined at runtime to a more precise type. The operator for removing unknown types clean acts as the identity function, since unknown values have special meaning. For instance is true, but is false. The operator for inserting annotations only inserts explicit casts for arithmetic operations, while for comparison operations, it behaves as the identity function.
The dynamic semantics for comparison is more involved. First, the best candidate type is searched, using the dynamic type information of the operands. Then, once the best candidate is found, both operands are implicitly cast to the corresponding types. Finally, the cast values, stripped of (potentially) casts to unknown, are compared using the \( compare (w_{1},{w_{2}},\)) function defined as 1 if \((dty({{w_{1}}} ~)~ = {dty({{w_{2}}})} \wedge w_{1} < w_{2}) \vee (dty{ {(w_{1})} } ~ < ~ dty{ {(w_{2}}) })\); 0 otherwise. If the dynamic types of the operands are equal, then a regular comparison operation is performed. If the types are different then the types are compared using an arbitrary hierarchy such that \({\mathbb {Z}}= {\mathbb {R}}< \textsf{String}\).
To illustrate, consider examples (1) , and (2) . The first example evaluates to . Since the (E\(v\)) elaboration rule inserts an explicit cast on every value, both values are cast to \(?\). Consequently, the chosen candidate for is \(? \times ? \rightarrow ?\), and the implicit cast leaves them untouched. Finally, as the dynamic type of both operands, with annotations removed, is \(\textsf{String}\) and \({\mathbb {R}}\) respectively, the comparison function yields as result. The second example evaluates to . In the process of elaboration, the left expression is cast to the unknown type, while the right expression is cast to an integer type. During the actual evaluation, the most suitable candidate is determined to be \({\mathbb {R}} \times {\mathbb {R}} \rightarrow ?\), which implies an implicit cast of both values into numerical values and , respectively. As both casted values share the same dynamic type, a standard comparison is carried out, resulting in .
Bidirectional implicit cast. For the case of \(\textsf{SQLite}\), the operator \( biconv \) always yields the unknown type for any pair of types and expressions. Here the relation is always defined. Consequently, for implicit casts from strings to numbers, when the string is not a valid number, the result is the same string.
Explicit cast feasibility. Operator allows casting any expression of type \(\tau \) to any type \(\tau '\).
Explicit cast. This operator is defined almost identically to implicit casts, except when the expression is a string. In this case, the cast is performed by extracting the largest numeric prefix from the string and then casting it to the required number type. If there is no numeric prefix, then the cast yields 0. For instance, evaluates to (12, 0).
Overload resolution. There is only one rule for overloading resolution \( resolve \). It yields the first best candidate found, using the type difference operator. Type difference yields 0 whenthe types are the same or when converting from integer to real, and 1 when converting either from an unknown type to any other typeor from string to real.
Given the design of \(\textsf{TRAF}/\textsf{PSQL}\) and \(\textsf{TRAF}/\textsf{SQLite}\), we now illustrate some examples of SQL query translations between their corresponding database engines. We show the effect of understanding the type semantics of each engine to justify translations that might seem counterintuitive.
From\(\textsf{SQLite}\)to\(\textsf{PSQL}\). Consider example E3., . This example runs successfully in \(\textsf{SQLite}\) and yields . This is expected due to the candidates , , and
However, this query does not typecheck in \(\textsf{PSQL}\). Addition is only defined for numeric values ( ), and requires to be defined (which it is not). What is defined is the implicit cast to a real . We can force such cast with the following translation that preserves the same behavior: .
Other examples such as E10, , are more challenging. In \(\textsf{SQLite}\), this query yields an empty result because strings are considered larger than numbers. A straightforward translation to \(\textsf{PSQL}\) that maintains this behavior is . However, a more general approach involves following the \( compare \) function, performing type testing using and then applying dynamic casts accordingly. For instance, the comparison could be translated to:
However, dynamic casts such as are not supported natively by \(\textsf{PSQL}\).
From\(\textsf{PSQL}\) to \(\textsf{SQLite}\) Consider once again example E10, , but now in the opposite direction. In \(\textsf{PSQL}\), this query returns (the best candidate type is \({\mathbb {Z}} \times {\mathbb {Z}} \rightarrow {\mathbb {B}}\)). Translating this query to \(\textsf{SQLite}\) requires mimicking the comparison behavior of \(\textsf{PSQL}\). According to the \( compare \) function, both arguments need to have the same dynamic type. This can be achieved either by casting the left operand: , or surprisingly, by casting the right operand: . By casting the right operand, \( resolve \) chooses \({\mathbb {R}} \times {\mathbb {R}} \rightarrow ?\) as best candidate, thus implicitly casting both operands to a number.
Future work may involve an automatic translation mechanism, which takes into account their type semantic and operational differences. Such mechanism would significantly reduce the manual effort required to adapt queries and help ensure consistency.
5 Properties
Based on the core relational algebra of \(\textsf{TRAF} \), we can establish metatheoretical results for eachformalized engine, consisting of lemmas and theorems regarding queries and their evaluation. For simplicity, we will refer to each engine by the name of its corresponding TRAF formalization. Specifically, we can articulate the formal distinctions among various engines, and pinpoints the exact requirements that abstract operator (e.g. for new engines) must meet to satisfy the properties. This aids us in better comprehending the process of transforming queries from one engine to another.
To state these theorems, we need several new definitions, in particular a way to typecheck rows r, tables t and databases \(\texttt{D}\):
A row is well-typed if every value is typed to its corresponding type in \(T\). A table is typed \(T\) if every row is typed as \(T\). An empty row is typed to any relation type \(T\). A database is typed \(\varGamma \), if every relation in \(\texttt{D}\) is typed to its corresponding type in \(\varGamma \). Note that these rules are non-deterministic, so any value can be associated to any name. The proofs require some properties about the implementation of abstract operators shown in Figure 8.
Fig. 8.
Requirements that abstract operator must meet to satisfy the properties.
Assuming R1, R2 and R3, any instantiation of \(\textsf{TRAF} \) is type safe, meaning that well-typed queries either reduce to a table or raise a controlled type error, i.e., an error captured and raised by the language upon detecting an inconsistency. In other words, the evaluation of well-typed queries doesnot raise uncontrolled errors, such as getting stuck. For instance, an ill-type query, or , where is not defined in , gets stuck as does not evaluate further.
Assuming also R4, a stronger theorem called type soundness states that, in addition to type safety, the resulting table indeed has the type of the query.
The use of \(clean({\cdot })\) is exclusively for \(\textsf{PSQL}\), and for cases such as . This query of type \(A\mapsto \textsf{String}\) evaluates to , typed as \(A\mapsto ?\) (literal strings are typed \(?\)), but \(clean({\textsf{String}}) = clean({?}) = \textsf{String}\).
Type safety is satisfied by every engine we consider, but type soundness is satisfied by all except \(\textsf{SQLite}\). This is because \(\textsf{SQLite}\) does not satisfy R4. For instance, in \(\textsf{SQLite}\), has type \(A\mapsto {\mathbb {Z}}\), but its evaluation is typed \(A\mapsto ?\). Also, \(\textsf{SQLite}\) permits storing string values in integer columns.
Moreover, \(\textsf{PSQL}\), \(\textsf{MySQL}\) and \(\textsf{SQLite}\) satisfy a theorem that states that if the programmer does not use any explicit cast in a well-typed query, then the query evaluates without error. To state this theorem we use the cast-free metafunction \(\texttt{CF}(Q)\), which is defined when \(Q\) does not have explicit casts of the form \(e {:}{:}\, \tau \) (definition in theextended version).
Theorem 3
(Cast-free queries do not fail). If R1, R5, R6, R9 and R10 hold, \(\texttt{CF}(Q)\), \( |- \texttt{D} : \varGamma \) and \(\varGamma |- Q : T \leadsto Q'\) then \(\llbracket Q' \rrbracket _{\texttt{D},\varGamma } \ne {\textbf {error}} \).
Neither \(\textsf{MSSQL}\) nor \(\textsf{Oracle}\) satisfy this property. Specifically, \(\textsf{MSSQL}\) does not satisfy R5 and R6, and \(\textsf{Oracle}\) does not satisfy R5. To illustrate why, let us consider table , schema \(R \mapsto {(A\mapsto \textsf{String})}\), and query . This query does not typecheck in PSQL, and evaluates successfully in other engines. But with one more row to R: , the same query evaluates to a runtime error in \(\textsf{MSSQL}\) and \(\textsf{Oracle}\).
Regarding cast insertion, the type of query translation, and the translation is unique:
Theorem 4
(Cast insertion is a type-preserving function). If R1, R2, R4, R7, R8 and R9 hold, and \(\varGamma |- Q : T\) then there exists a unique \(Q'\) such that \(\varGamma |- Q : T \leadsto Q'\) and \(\varGamma |- Q' : T\).
This theorem is satisfied by the five engines we studied.
6 Experimental Validation
To validate the adequacy of the formalism and its instantiations, we adopt approaches similar to those used for the validation of formal models of Python [30], JavaScript [29], and more closely related to our work, SQL [19], by testing against real-world implementations. We develop \(\textsf{PyTRAF} \), an implementation of \(\textsf{TRAF} \) in Python, and create one instance for each of five engines. We generate multiple random queries and verify that the results from the actual engine match those obtained from the prototype.
We generate a total of 100, 000 random SQL queries for each engine, successfully confirming that our design aligns with the behavior of each individual engine. This process is challenging when dealing with engine-specific query optimizations. In particular, sometimes \(\textsf{PyTRAF} \) reports an error while the engine returns a table. This discrepancy occurs due to avoidance of executing certain subexpressions or subqueries that are prone to failure. For this reason, we divided the validation in two categories: a termination-insensitive validation, and a termination-sensitive validation6.
The termination-insensitive validation approach involves verifying that, if the evaluation of a query in \(\textsf{PyTRAF} \) and in the engine result in tables, then these tables must be equivalent. In \(\textsf{PyTRAF} \), the query generation is parameterized by the engine due to subtle discrepancies between engines For instance, \(\textsf{MSSQL}\) and \(\textsf{Oracle}\) lack a boolean type and represent booleans using integers. To avoid floating number precision mismatches, real numbers are represented as decimals in both \(\textsf{PyTRAF} \) and the engines. Note that comparing real numbers using a notion of closeness might be feasible, but it presents a greater challenge when these results are then cast to strings. Finally, in \(\textsf{MySQL}\), we had to cast some operands of arithmetic operators to decimal to avoid precision issues. In addition, we check whether a query that succeeds in \(\textsf{PyTRAF} \) will also succeed in the real engine. We have observed that this is true for \(\textsf{MySQL}\), \(\textsf{MSSQL}\) and \(\textsf{SQLite}\), but not for \(\textsf{PSQL}\) and \(\textsf{Oracle}\). The reason for this discrepancy is that some engines perform optimizations that affect the evaluation order, eagerly casting aliased subexpressions in subqueries whose condition is always false, leading to unsound results in the presence of effects such as cast errors. In other words, the optimizations performed by the engines are sound only for “pure” queries (those that do not fail). However, the impact of these optimizations on erroneous queries appears to be overlooked by engine providers.
The termination sensitive validation approach is a stronger result. It involves verifying that, if the evaluation of a query in \(\textsf{PyTRAF} \) yields a table, then evaluation of the query in the engine results in an equivalent table. Furthermore, if a query in \(\textsf{PyTRAF} \) reports an error, then the query in the engine also reports an error. It is important to note that sometimes distinguishing between errors resulting from type checking or evaluation solely by inspecting the engine’s output might not be feasible. Consequently, if \(\textsf{PyTRAF} \) reports a type error (either statically or dynamically), we verify that the real engine throws any kind of errors. To achieve this stronger validation, we had to perform several simplifications (explained in the extended version) on the generation of queries because some query optimizations prevent certain sub-expressions from being evaluated.
7 Related Work
Traditionally SQL has been implemented with some sort of either static typing or syntactic checking, though the issue of type errors and type disciplines has received little attention. Nonetheless, there are two lines of works that relate to this work. In the Databases literature, the consideration of corner cases such as NULLs and dynamically generated queries involves typing issues. Also, some engines, like \(\textsf{SQLite}\) have “flexible” type systems. In Programming Languages, type systems are a central topic, but SQL and databases have received little attention. Both areas have been sources of inspiration and techniques for our work.
Classical database literature. There are many works formalizing SQL [8‐10, 27]. Guagliardo and Libkin [19] developed a comprehensive formal semantics for SQL whose core we follow here. Following the classic framework in the area, they assume that all comparisons and operations are applied to arguments of the right types. Therefore essentially they do not deal with typing issues. Regarding errors in SQL, based on previous work [2, 33, 39], Taipalus et al. [35] review SQL errors to build a unified error categorization. In further work, Taipalus et al. [34] compares the error messages of the four most popular relational database management systems (MySQL, Oracle, PostgreSQL, and SQL Server) in terms of error message effectiveness, effects, and usefulness, and error recovery confidence. Our work does not deal with error messages, but instead with detecting errors. Finally, regarding formalization, Benzaken et al. Benzaken and Contejean [4] provide a Coq mechanised, executable, formal semantics for a realistic fragment of SQL. Their coq formalization covers null values, functions, aggregates, quantifiers and nested potentially correlated sub-queries. Ricciotti and Cheney [31] complement and deepens the work of Guagliardo and Libkin [19] by making the notions of their semantics and proof precise and formal using Coq. [5] propose the first mechanically verified compiler (DBCert, using Coq) for SQL queries. These works assume the precise matching of types and therefore there are no implicit casts.
Flexible typing databases. A distinctive example is SQLite, the most widely deployed database engine [21], which enjoys flexible typing. Data of any type may be stored in any column of an SQLite table (except an INTEGER PRIMARY KEY column, in which case the data must be integral) and columns can be declared without any data type [12]. SQL queries are viewed as strings and little error checking is done for dynamically-generated SQL query strings. Wassermann et al. [38] propose a static program analysis technique to verify that dynamically-generated query strings do not contain type errors. Similar to \(\textsf{TRAF} \), they employ a type system to reject invalid dynamically-generated queries. However, their focus does not lie in the formalism of dynamic semantics or casts, and the evaluation uses the grammar of \(\textsf{Oracle}\).
Strongly-typed queries. The development of programming language libraries and tools for type-checking queries has been extensively explored [3, 13‐16, 24‐26, 32]. However, the formalization of implicit and explicit type casts has not been addressed. Additionally, these studies lack a practical exploration of the varied behaviors induced by typing in industry-standard database engines. From a formal perspective, significant progress has been made in the area of type inference for relational algebra [7, 28, 36] and SQL [11, 23]. In \(\textsf{TRAF} \), we presume the existence of a typed schema, and consider the definition of type inference as an area for future exploration.
8 Conclusion
In this paper, we identify some discrepancies in behavior regarding the handling of types both statically and dynamically in current SQL engines. This presents practical problems (e.g. when porting queries among engines) that are challenging to address.
We demonstrated that addressing this issue is feasible by integrating a light-weight typing system. Indeed, we present \(\textsf{TRAF} \), a formal framework for a typed relational algebra with support for implicit and explicit type casts. \(\textsf{TRAF} \) permits to formally understand the behavior of different database engines; we validate this expressiveness by providing five different instantiations.
Our framework highlights the necessary requirements for any concrete instantiation to satisfy formal properties such as type safety and soundness, among others. The typing discrepancies addressed shed light that certain apparently minor design decisions of engines may lead to major changes in behavior. As future work, we believe this initial step that constitutes our work should be extended to deal with discrepancies under query optimizations performed by many practical engines.
It would also be valuable to develop a technique that standardizes query behavior according to a specific database semantics model, by inserting sufficient casts or other forms of disambiguation so that the query runs correctly across different databases. Additionally, we could extend the scope to encompass the casting behavior of various types; for example, how one system handles casting an integer or float to a less precise decimal type may differ from the approach of another system.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
A good example is the following: To cast an exact number to an exact numeric type, e.g. from a real to int, the specification says: "If there is a representation of SV [source value] in the data type TV [target value] that does not lose any leading significant digits after rounding or truncating if necessary, then TV is that representation. The choice of whether to round or truncate is implementation-defined." ANSI SQL 1992, Sec. 6.10, Case 3)a)i) )
Note that we do not include boolean expressions as general expressions, because some engines do not support selecting boolean values in queries (e.g. in \(\textsf{MSSQL}\) and \(\textsf{Oracle}\),
is a syntactically invalid query while it is valid in \(\textsf{PSQL}\), \(\textsf{MySQL}\) and \(\textsf{SQLite}\)).
The names "termination sensitive" (TS) and "termination insensitive" (TI) are borrowed from hyper-properties such as noninterference (NI). In NI, TI-NI means that NI holds only when both executions terminate successfully, while TS-NI means TI-NI plus equitermination. Therefore, TI validation means that the engine and model coincide whenever they both don’t fail, while TS validation means that if one fails, the other must fail as well.