1 Introduction
Category | Languages | Data models | Description |
---|---|---|---|
SQL Extensions | SQL/XML | Relation, XML | A part of SQL:2003; Examples: PL/SQL, T-SQL, PL/pgSQL, Db2, MySQL, etc |
SQL/JSON | Relation, JSON | A part of SQL:2016; Examples: PL/SQL, T-SQL, PL/pgSQL, Db2, MySQL, etc | |
SQL++ | Relation, JSON | Extend SQL with path expressions; Examples: AsterixDB’s SQL++, Couchbase’s N1QL | |
SQL/SPARQL | Relation, RDF | Non-standard extensions of SQL; Examples: Oracle’s SQL/SPARQL, Virtuoso’s SPASQL | |
SQL/PGQ | Relation, property graph | A planned SQL extension for property graph; not implemented yet | |
GQL | Relation, property graph | The latest planned standard for property graph; not implemented yet | |
XPath/XQuery Extensions | Core-XPath | XML, JSON | Navigational capability of XPath; not implemented yet |
\({\textbf {GXPath}}_{reg}\) | XML, JSON, graph | XPath extension towards document and graph data; not implemented yet | |
JSONPath | XML, JSON | A mimic of XPath for both XML and JSON; inspired the SQL/XML standard | |
JSONiq | XML, JSON | A mimic of XQuery for both XML and JSON; Example: the Zorba XQuery processor | |
Graph Extensions | XSPARQL | RDF, XML | A combination of SPARQL and XQuery; Example: Marklogic’s XSPARQL |
AgensQL | Graph, relation | A combination of Cypher and SQL; Example: AgensGraph’s query language | |
AgeQL | Graph, relation | A combination of SQL and openCypher; Example: PostgreSQL AGE’s query language | |
Native | AQL | KV, JSON, property graph, GeoSpatial | A native multi-model query language; ArangoDB’s query language |
OrientQL | Graph, JSON, KV | A native multi-model query language; OrientDB’s query language | |
KQL | Relation, Time series, GeoSpatial | A combination of SQL and Gremlin; Example: OrientDB’s query language |
2 Preliminaries
2.1 Multi-model data
:Person
, :Post
, :Tag
, and :follows
) and associates with a set of key/value pairs as its properties. Nodes of other types, i.e., :Post
, :Tag
, are not presented in the graph. The JSON document records the order information, while invoices are kept in the XML document so that they can be presented to the customers in various styles. Customer feedback on the products is kept in key/value pairs (with a pair of keys, i.e., custID and productID). Finally, we use a relational table to record the information of customers.
2.2 Multi-model queries
SELECT-FROM-WHERE
queries in SQL in which the WHERE
clause uses exclusively conjunctions of atomic equality conditions [6]. However, we cannot express the transitive closure and aggregate with CQ (likewise with the relational algebra and calculus) [37]. Normally, one can extend the expressive power of CQ in several ways: (1) adding more operations such as union and negation; (2) allowing recursion, which is essential for extending CQ to graph data; (3) counting, which is essential for aggregate queries; (4) introducing more powerful grammars (e.g., regular expression or context-free grammar) for formulating complex query expressions.2.3 Cross-model query processing
Category | DBMSs | Query languages | Supported models | Description |
---|---|---|---|---|
Relation-based Extensions | AsterixDB | SQL++ | Relation, JSON | Extend SQL with path expressions |
Oracle | PL/SQL | Relation, JSON, XML, RDF | Extended from SQL | |
MySQL | Oracle MySQL | Relation, KV | Extended from SQL | |
PostgreSQL | PL/pgSQL | Relation, JSON, XML, KV | Extended from SQL, Turing-complete | |
SQL Server | T-SQL | Relation, JSON, XML | Extend SQL to support JSON and XML, Turing-complete | |
IBM Db2 | SQL | Relation, XML | Extend SQL to support trees, not Turing-complete | |
Document-based Extensions | MarkLogic | XQuery/XPath | XML, JSON, RDF, Relation | Extended from XQuery, Turing-complete |
Zorba | XQuery/JSONiq | XML, JSON | Extended from XQuery | |
Graph-based Extension | OrientDB | OrientQL | Graph, JSON, KV | Extended from graph model, not Turing-complete |
AgensGraph | AgensQL | Graph, Relation | Cypher on SQL or SQL on Cypher, not Turing-complete | |
Aphache AGE | openCypher+SQL | Graph, Relation | a graph extension of PostgreSQL to use SQL along with openCypher | |
Native | ArangoDB | AQL | JSON, graph, KV | SQL-like query language |
OrientDB | OrientQL | Graph, JSON, KV | SQL-like | |
Kusto | KQL | Relation, time series, GeoSpatial | SQL-like |
3 Relational extensions
3.1 Relational query languages and the SQL standards
SELECT-FROM-WHERE
(SFW) clauses, and we say a query language is SQL-like if it has similar clauses as well. By 1986, SQL had been formally adopted by the ANSI and ISO as a standard database query language, namely SQL-86 (also known as SQL1). The earliest versions of SQL lacked support for some aspects of the relational model, including subqueries, primary keys and referential integrity. All these problems were solved by the SQL:1992 standard (or SQL2) [98]. Procedural extensions SQL/PSM [99] formally became a part of the SQL:1996. The common table expressions (CTEs), WITH [RECURSIVE ]
construct, was introduced into SQL:1999 (also called SQL3) to allow recursive queries [100, 101].3.2 SQL extensions toward semistructured data
XML
and JSON
into the SQL standard, where each data type associates with a set of constructors, functions, and rules for XML(JSON)-to-SQL mapping. Even though XML and JSON are somewhat similar—documents with nested structures—their integration into SQL is quite different. The most striking difference is that the standard does not define a native JSON type like it does for XML. Instead, the standard uses strings to store JSON data. Other than creating a new data type, SQL++ builds a unified data model to capture and manipulate both relational data and JSON documents.3.2.1 SQL/XML and SQL/JSON
Oracle 11g | IBM DB2 9.7 | MS SQL Server 2008 | MySQL 5.1.30 | PostgreSQL 9.1 | |
---|---|---|---|---|---|
XML Datatype | XMLType | XML | XML | CLOB | XML |
XPath/XQuery Syntax | Full | Full | Full | XPath | Full |
Predicates | Full | Full | Partial | No | Full |
Functions | Full | Full | Partial | Partial | Ful |
XML
data types like SQL Server or PostgreSQL but just takes XML as a CLOB (Character Large Object) data type. Oracle entitles the XMLType
’ instead of ‘XML
’. (2) Support for XPath/XQuery. In order to retrieve XML data within a SQL query, one must incorporate the XPath expressions or XQuery’s FLWOR
clauses into the standard SQL syntax. Excepting MySQL supports only XPath, other RDBMSs provide full supports for XPath and XQuery. (3) Special functions for handling XML data. The standard defines 9 functions that allow application to retrieve XML directly within SQL queries, including XMLAGG
, XMLCONCAT
, XMLCOMMENT
, XMLELEMENT
, XMLFOREST
, XMLPARSE
, XMLPI
, XMLROOT
, and XMLSERIALIZE
. Oracle, DB2, and PostgreSQL provide full support for all the XML functions. MySQL provide two functions, ExtractValue()
and UpdateXML()
, to work with XPath. (4) Rules for mapping XML to SQL tables. The standard defines several rules for mapping SQL types to XML or in a reversed mapping. For example, mapping an SQL table to XML and an XML Schema document.3.2.2 SQL++
person.name
”, person
is a variable that evaluates to an object, and the identifier name
is used to find a field in the object whose name matches that identifier. Since the result of a path expression is an object, it can serve as one step in a longer path expression such as “person.name.first
”. Since SQL++ supports boolean operators such as NOT
, AND
, and OR
, we can issue more complex query patterns by binding several path expressions with such operators. Unfortunately, SQL++ doesn’t support complex queries like RPQs, CRPQs, and 2CRPQs. Figure 4 shows a cross-model query of SQL++ accessing both the relational table and JSON document. The query is a composite of the relational query and path query. For more examples of SQL++, please refer to the technical papers [108, 109] and D. Chamberlin’s tutorial [109].3.3 SQL extensions toward graph data
3.3.1 SQL extension for RDF
SPARQL
”. The subquery draws only from relations represented as RDF statements and returns data in tabular form to an SQL interface. Virtuoso has a universal server, which is a middleware that combines the functionalities of a traditional RDBMS, an RDF engine, an XML engine, a web server, and even a file server in a single system. While processing a SPASQL query, the universal server will incur an RDF function call to the RDF engine for handling the SPARQL subquery and the result is then returned to the host SQL engine for further processing. In this way, one can retrieve both relational and RDF data with a SPASQL query like Query 3.
3.3.2 SQL extension for property graph
SELECT-FROM-WHERE
clauses, and a query is represented as a tree of clauses, each introduced by a distinct keyword and optionally further qualified by sub-clauses, expressions, and patterns. According to the working draft [105] of the GQL specification in 2019, GQL reuses basic literal and expression syntax from SQL but is amended with the following features.SELECT-(MATCH)-FROM-WHERE-(RETURN)
clauses in which the MATCH
and RETURN
clauses are optional when querying relational tables. Therefore, GQL can provide full compliance with the SQL standard. Given a pattern MATCH (p:Person)-[:lives
_in
]->(c:City)
, the query result returned by the SELECT
clause is represented in a tabular fashion. We have to use the optional RETURN
statement if we need to construct a graph as output. The syntax of GQL’s RETURN
clause is given in Query 4. GQL follows the definition in Cypher and also provides multiple semantics for graph pattern matching [121, 122], i.e., isomorphism and homomorphism. In addition, GQL borrows the syntax of regular path expressions from PGQL [66], i.e., “(a:Person)-/:friend
_of*/-
>(a:Person)
”.
4 Document extensions
4.1 XPath/XQuery extensions
4.1.1 Core XPath/XQuery
/axis::node
_label[predicates]
, has three components: (i) an axis, (ii) a node test, and (iii) zero or more predicate. XPath supports a number of axes: ancestor, ancestor-or-self, attribute, child, descendant, descendant-or-self, following, following-sibling, parent, preceding, preceding-sibling, and self. All nodes in an XML document conform to a document order in which the data is represented as a hierarchy tree and a path query is evaluated in the order with respect to a context node. The axes can be divided into a forward axis and a reverse axis according to the direction from the context node. For example, child
or descendant
specifies the two directions for XML navigation respectively. A node test within an XPath query is to retrieve nodes and the predicates after the node test is to filter a sequence of values. Apart from using the name of a node or a wildcard (to select unknown nodes), we can also use other node tests such as node()
and text()
.4.1.2 Document navigation and the core-XPath fragment
A/B/C
). Query 5 presents 5 simple path queries of XPath. P1 lists the descriptions of all items offered for sale by Smith. P2 has the same purpose as P1 but uses a different axis. P3 defines the path P1 as a variable description
. P4 reuses the variable in P3 to list the status attribute of the item that is the parent of a given description. P5 is to list all the items that are the parents of a given description but excluding the @status
elements. We can use predicates in the steps of a path query to filter a sequence of values. For example, in the step item[seller = “Smith”]
, the predicate (seller = “Smith”
) is used to select certain item nodes and discard others. We will refer to the items in the sequence being filtered by a predicate as candidate items. The variable description
given in P3 is used to bound a node in the document, and traverses will start at the nodes bound to the variable. Path expressions can be also written in abbreviated syntax. Within a path expression, a single dot (.
) refers to the context node (self-axis), and two consecutive dots (..) refer to the parent of the context node. These notations are abbreviated invocations of the child (/
), descendant (//
), and attribute axes (@
), respectively.
descendant::p
(abbreviated as .//p
) denotes in any XML tree T, the set of all pairs (m, n) with n a descendant node of m that has tag name p. Thus the binary relation is equivalent to a conjunctive query: \(\phi (x, y)={\text {descendant}}(x, y) \wedge \textrm{p}(y).\)descendant::p(1)/.../descendant::p(n)/descendant::q
, for all permutations of \(1 \ldots n\). Therefore, conjunctive path queries and 2-way conjunctive path queries can be naturally supported by all versions of XPath.MMDBs | Data models | Query languages |
---|---|---|
Axis | := | self | child | parent | right | left | descendant |
:= | | ancestor | following | preceding | |
:= | | following sibling | preceding sibling | |
NameTest | := | QName | * |
Step | := | Axis::NameTest |
PathExpr | := | Step |
:= | | PathExpr/PathExpr | |
:= | | PathExpr union PathExpr | |
:= | | PathExpr[NodeExpr] | |
:= | PathExpr | |
:= | | not NodeExpr | |
:= | | NodeExpr and NodeExpr | |
:= | | NodeExpr or NodeExpr |
(child::p)*/child::q
, to list all child items or descendant items named q
, where (path)*
denotes the reflexive transitive closure of the binary relation denoted by path
. RPQ can be viewed as a mix between Core XPath and regular path expressions [16]: it has the filter expressions of the former and the Kleene star of the latter. With the extended operators, one can also easily issue CRPQ and 2CRPQ queries.relOp
is one of \(=\), <, >, \(\le \), \(\ge \), \(\ne \). Twig queries can be seen as an abstraction of a core fragment of XPath and XQuery [139]. Therefore, a substantial amount of work on XML query evaluation and optimization [70, 140, 141] using tree patterns as a basis.4.1.3 Graph pattern matching with XPath/XQuery
4.1.4 Querying relational data with XPath/XQuery
4.2 JSON-oriented extensions
4.2.1 Document navigation with JSONPath
.store.book[0].title
” and the bracket-notation “[‘store’][‘book’][0][‘title’]
” to formulate path queries as well. Internal or output paths will always be converted to the more general bracket notation. JSONPath allows the wildcard symbol * for member names and array indices. It borrows the descendant operator ‘.’ from E4X and the array slice syntax proposal [start:end:step]
from ECMASCRIPT 4. Expressions of the underlying scripting language (<expr
>) can be used as an alternative to explicit names or indices as in “.store.book[(@.length-1)].title
”. Filter expressions are supported via the syntax ?(<boolean expr>) as in “.store.book[?(@.price
< 10)].title”
.4.2.2 Querying document data with JSONiq
FLWOR
construct, the language’s functional aspect, the semantics of comparisons in the face of data heterogeneity, and declarative, snapshot-based updates. For example, Query 8 is used to calculate the average score for each question in a test.
5 Graph extensions
5.1 Graph query languages
:follows
relationship (of length 2) and one person in each pair had commented on the other’s post. PGQL pioneered support for full regular path expressions [66]. Cypher was influenced by XPath [40] and SPARQL [43, 59], but it only supports a restricted form of RPQs: the concatenation and disjunction of single relationship types, as well as variable length paths (i.e., transitive closure). For example, (a:Person)-[:knows*2]-
>(b:Person)
describes paths of fixed length of 2, (a:Person)-[:knows*3..5]-
>(b:Person)
represents paths have variable lengths from 3 to 5, and (a:Person)-[:knows*]-
>(b:Personb)
stands for paths of any length. GSQL and G-CORE combine the ASCII-art syntax from Cypher and the regular path expression syntax from PGQL. With the PGQL’s regular path expression, we can express RPQ and CRPQ queries. By using the boolean operators such as conjunction (\( A \& B\)), disjunction (A|B), negation (!A), and grouping/nesting (\( (A \& B)|C\)), we can express even more complex graph queries. In addition, we can issue 2CRPQs by removing the direction of relationships, e.g., (a)-[:knows]-(b)
implies a
and b
know each other.Language | Core query syntax and Example query |
---|---|
PGQL | SELECT-FROM-MATCH-WHERE |
SELECT a.name, b.name, x.id | |
FROM social_network | |
MATCH (a:Person)-/:follows(1,2)/->(b:Person) | |
WHERE (a)-[:hasPost]->(x:Post)<-[:comment]-(b) | |
Cypher | FROM-MATCH-WHERE-RETURN |
FROM social_network | |
MATCH (a:Person)-[:follows*1..2]->(b:Person) | |
WHERE (a)-[:hasPost]->(x:Post)<-[:comment]-(b) | |
RETURN a.name, b.name, x.id | |
G-CORE | SELECT-MATCH-ON-WHERE |
SELECT a.name, b.name | |
MATCH (a:Person)-[:follows*1..2]->(b:Person) ON social_network | |
WHERE EXISTS ( | |
CONSTRUCT () | |
MATCH (a)-[:hasPost]->(x:Post)<-[:comment]-(b) ) | |
GSQL | SELECT-FROM-WHERE |
SELECT a.name, b.age, x.id | |
FROM social_network | |
WHERE (a:Person) -/:follows*1..2/-> (b:Person) | |
(a) -[:hasPost]-> (x:Post) | |
(b) -[:comment]-> (x) | |
ORDER BY a.name |
5.2 SPARQL extensions
Multi-model data | Query language | Functionality |
---|---|---|
RDF+Relational | SPARQL+SQL | SQL procedures and aggregation |
RDF+Text | SPARQL+Full-Text Search | Keyword search over RDF triples |
RDF+XML | SPARQL+XQuery | FLWOR expressions on RDF |
RDF+Geo-Spatial | SPARQL+Geo-Spatial | Operations on geometric objects |
Supported data types | Extended functions |
---|---|
Point, Multipoint, LineString, MultiLineString, Polygon, MultiPolygon, GeometryCollection | equals, disjoint, intersects, touches, within, contains, overlaps, crosses, nearby, distance |
5.3 Cypher extensions
Data models | Language | Functionality |
---|---|---|
Property graph + Relational | Cypher + SQL | SQL-in-Cypher, Cypher-in-SQL |
Property graph + ML models | Cypher + UDF/Procedure | Regression, Classification, Graph embeddings |
6 Native multi-model query languages
Query language | Supported data models | Query syntax |
---|---|---|
AQL (ArangoDB) | Key-Value, JSON, Graph, GeoSpatial | FFR (For, Filter, Return) |
OrientQL (OrientDB) | Key-Value, JSON, Graph | SQL-like syntax with dot(. ) |
KQL (Kusto) | Relational, Time series, GeoSpatial | Dataflow operators with pipe( | ) |
6.1 ArangoDB query language
Data models | Syntax/Operator |
---|---|
Key-Value | Return (@collection, @key) |
JSON | FOR @doc IN @collection FILTER Boolean_function(@doc) RETURN @doc |
Graph | FOR @vertex,@edge,@path IN @min..@max OUTBOUND|INBOUND|ANY @startVertex GRAPH @graph RETURN {@vertex, @edge, @path} |
GeoSpatial | GEO_LINESTRING(), GEO_MULTILINESTRING(), GEO_MULTIPOINT(), GEO_POINT(), GEO_POLYGON(), GEO_MULTIPOLYGON(),DISTANCE(), GEO_CONTAINS(), GEO_DISTANCE(), GEO_AREA(),GEO_EQUALS(), GEO_INTERSECTS(), IS_IN_POLYGON() |
FOR-FILTER-RETURN
(FFR) expressions where the FOR operation is used for data iteration, FILTER for data filtering and joining, and RETURN for data projection and returning. In particular, the results can be returned as JSON or table, or be visualized as graphs. In addition to advanced array operations, AQL also provides the syntax of graph traversal for named graphs and edge collections using the FFR expressions. Such a feature allows for navigational path queries, pattern matching queries, and shortest path queries expressed in AQL. Interestingly, AQL can also specify a single statement to retrieve and combine data from JSON, key-value, and graph. Query 18 shows an example of AQL query, in which we can naturally handle cross-model queries.
name=“Nike”
, and the feedback should have 5-star ratings. Specifically, line 1–3 operates on the graph, JSON, and key-value data, respectively. In the graph traversal, the min and max length is 1 and 3, the direction is outbound from the starting person with Id 1 in the “knowsgraph” which is an edge collection. The filtering conditions and equijoin predicates are specified in the Filter clause accordingly during lines 4–6. In line 4, the JSON orders are joined with persons in the graph on Ids. line 5, it uses the [*] operator to access to the brand element in items array. In line 6, the persons in the graph are joined with feedback that has a 5-score rating. Finally, the filtered persons and feedback are returned as a JSON array.6.2 OrientDB query language
Data models | Syntax/Operator |
---|---|
Key-Value | SELECT * FROM #rid |
JSON | SELECT [ <Projections> ] [ FROM <Target> ] [ WHERE <Condition> ] [ GROUP BY <Field> ][ ORDER BY <Fields> ] [ UNWIND <Field>] |
Graph | TRAVERSE < relationship> [FROM <target>] [MAXDEPTH <number> | WHILE < condition>] [STRATEGY <strategy>] |
Geo-Spatial | Point(), Line(), Polygon(), MultiPoint(), MultiLine(), MultiPolygon(), ST_AsText(), ST_GeomFromText(), ST_AsGeoJSON(), ST_GeomFromGeoJSON(), ST_Within(), ST_Contains(), ST_Disjoint(), ST_Intersects(), ST_AsBinary() |
SELECT-FROM-WHERE
structure. However, it does not completely follow the standard SQL syntax. For instance, the JOIN syntax is not supported while relationships are represented by links. Thus, it uses the dot (.) notation to navigate links or embedded documents. For the graph traversal, it provides a TRAVERSE command that recursively navigates the graph in either depth-first or breadth-first search. Query 19 illustrates an implementation of UniBench Q5 in OrientQL’s syntax.
6.3 Kusto query language (KQL)
Data models | Syntax/Operator |
---|---|
Relational | Source Table 1 | [ Tabular operators* ] [ union | join ] [Source Table n | [ Tabular operators* ][ union | join]]* |
Time series | make-series, avg(), count(), max(), min(), percentile(), stdev(), sum(), variance(), series_fir(), series_iir(), series_fit_line(), series_outliers(), series_periods_detect(), series_stats(), series_fill_backward(), series_fill_const(), series_fill_forward(),series_fill_linear() |
Geo-Spatial | geo_distance_2points(),geo_distance_point_to_line(),geo_line_densify(), geo_point_in_circle(),geo_point_in_polygon(),geo_point_to_geohash(), geo_point_to_s2cell(),geo_polygon_to_s2cells(),geo_polygon_densify() |
distance
_1
_to
_100 m
to compute the shortest distance between two geospatial coordinates. Line 3 gives the range of the distance that should be between 1 and 100 m. Line 4 contains another filtering condition in the where operator that requires the events should take place between 2007-11-01 and 2007-12-01. Finally, the last line projects the results on three columns: distance_1_to_100m, State, and EventType. In addition to the single-source analysis, KQL supports cross-database and cross-cluster queries for complex data analysis.