1 Introduction
2 Preliminaries
ex:iri1
, ex:iri3
, rdfs:label
, and ex:Sex
are examples of IRIs in \(\mathcal {I}\), the first two with the role of nodes and the second two with the role of predicates. On the other hand, "11M"
and 2018
are literals in \(\mathcal {L}\). Finally, (ex:iri1
, ex:Sex
, ex:iri3
) is an (s, p, o) triple.ex:childOf
predicates. These patterns are called property paths. They are defined via a specialized form of expressions called path expressions (similar to regular expressions) and offer a succinct way to write parts of basic graph patterns and also extend matching of triple patterns to arbitrary length paths [35].SELECT
\(\mathbf {V}\) WHERE
P”, with \(P=\{t_1,\ldots \,,t_n\}\) being a set of triple patterns (BGP). Optionally, one or more FILTER
clauses further constrain the variables in P. Let \(\mathcal {X}_{P}\) denote the finite set of variables occurring in P, i.e., \(\mathcal {X}_{P}\subset \mathcal {X}\) [63], then \(\mathbf {V}\) is the vector of variables returned by the query such that \(\mathbf {V}\subseteq \mathcal {X}_{P}\) . Additional operators such as UNION
or OPTIONAL
allow more than one BGP in a query by defining the non-conjunctive semantics of their combination. Finally, SPARQL queries can also make use of GROUP BY
and aggregate operators.3 Access patterns
-
Constants the presence of constant values (as opposed to variables which have bindings).
-
Filter The presence (or absence) of a filter on a range of values.
-
Traversal The complexity and type of the traversal described by different triple patterns of the BGP.
-
Pivot How different triple patterns of the BGP are linked together by some common atom in the BGP (here called a pivot).
-
Return The information expected to be returned.
-
Write Whether and how the BGP causes a change in the contents of the database.
isLiteral
, isBlank
, and isURI
. Similarly, RDF literal strings can be annotated with language tags allowing a query to filter based on those tags. For instance, to distinguish that the string “Rome” is in French and not in English, RDF represents it as "Rome"@fr
instead of "Rome"@en
. Hence, a query could select only strings marked as @fr
.DISTINCT
keyword is used, only the set of distinct bindings for each variable is required. This case is named Values (distinct). Moreover, some queries could require the bindings for some variable to be returned in some predefined order. We refer to this case as Values (sorted). For some queries, it is sufficient to identify whether a variable binding exists satisfying the pattern, but we are not required to return the binding. In some other cases, a query just needs to verify whether a specific path exists or vice versa; when a specific connection does not exist, this is also addressed by a negation on an existence check. SPARQL ASK
queries are an extreme example where the entire query returns no variable bindings but only the value true (exists) or false (does not exists) for a specific BGP. Finally, a query could be required to return aggregated values, i.e., the count of matchings for a BGP or max and min values for literals bindings.wdt:P31/
wdt:P279*
wd:Q1190554
is of the form p1 / p2* o
, meaning that it would match two alternative paths: (1) 1-hop traversals over p1 = wdt:P31
reaching the target node o=wd:Q1190554
directly, and (2) *-hop traversals starting with one edge for wdt:P31
and reaching the object through a sequence of arbitrarily long paths matching the p2 = wdt:P279
triple pattern. Hence, the query contains two 1-hop traversals (marked with stars, one in each direction, since wdt:P279*
is optional) and a composite property path (wdt:P31/wdt:P279*
, see Definition 3). It contains a triple pattern with constants in both (p) and (o) (highlighted in dashed blue) and two triple patterns with a constant only in (p) (in solid blue). Moreover, the query contains both a closed range filter and two different special filters: one for the DATATYPE
and one for the LANG
property of literals. Finally, the ?event
variable is a 3-way pivot, which can also be executed as a set of binary \(s\equiv s\) pivot access patterns.
4 A design space for RDF data representations
4.1 Cost model
4.2 Data representation compatibility
isLiteral
), then any triple with a IRI as object is unneeded.-
seek compatible with \(\mathcal {A}\) if \(C^{r}_{1}(\mathcal {A}|\mathcal {D})=\varepsilon \), where \(\varepsilon {\in }\mathbb {R}^{+}\) is some small system-dependent constant independent of \(|{\mathcal {G}}|\) (e.g., if \(\mathcal {D}\) is a hash-table the fixed cost to search elements in \(\mathcal {D}\) is usually approximated to the constant 1.2).
-
sequence compatible with \(\mathcal {A}\) if \(\mathcal {S}\) can be sequentially retrieved after the initial seek, that is \(C^{r}_{\varOmega }(\mathcal {A}|\mathcal {D})=0\) and \(C^{s}_{\varOmega }(\mathcal {A}|\mathcal {D})\le \mathrm{max}(|G|,|\mathcal {S}|)\).
-
selection compatible with \(\mathcal {A}\) if no excess results are retrieved, that is \(C^{r}_{\varOmega }(\mathcal {A}|\mathcal {D})+C^{s}_{\varOmega }(\mathcal {A}|\mathcal {D})\le |\mathcal {S}|\).
4.3 The design space dimensions
5 Data representations in RDF triplestores
5.1 Subdivision
5.2 Compression
5.3 Redundancy
s p* o
(i.e., property paths with kleene-start patterns, see Definition 3). Only one system (gStore [88]) offers a data representation tuned to matching an N-way same-position pivot (a star-shaped subgraph structure). gStore is also notable for being the only system employing a Bloomier filter construct [19] (the VS*-tree) as a secondary data representation. Bloom filters allow determining the existence of a data item using a compact representation with low latency at the expense of false positives but never false negatives. Bloomier filters [19] extend this capability to allow several functions, in this case returning the vertices matching a set of PO/SP patterns. The use of a PO/PS hash map as part of the VS*-tree lookup process is an almost singular example of using two-position combination hash maps (e.g., sp\(\rightarrow \)o). Notably, a recent extension of the Jena (TDB) system to support a worst-case-optimal join algorithm [38] has added three additional B+-tree indexes to support this pattern. This results in a high level of redundancy of the data, which new approaches are trying to overcome with more compressed (and way less redundant) indexed representations [8]. Also, Trident [76] employs the replication of the six permutations of the SPO positions, but employs adaptive storage representation to reduce the memory footprint.Subdivision | Compression | Redundancy | |
---|---|---|---|
Prevalent approaches | The increasing use of hash-maps rather than B+-trees. | Encoding of literals as integers. Triple store compression using bit-based representations. | Multiple (\(\ge 3\)) representations per system. |
Gaps | Limited use of complex and hybrid indexes. Limited support for representations for optimal N-way joins. | No compression of secondary indexes. | No support for special filters (e.g., language). No use of bloom-filters for existence checks. Limited support for complex traversals such as k-hops and reachability. No support for two-position (e.g., sp\(\rightarrow \)o) hash maps. |
5.4 Summary
6 Design space analysis
Fully instantiated | Partially instantiated | Uninstantiated | |
---|---|---|---|
Subdivision | \(\uparrow \) smaller search space | \(\uparrow \) smaller search space if compatible with query; \(\downarrow \) more search steps if incompatible | \(\downarrow \) more search steps |
Compression | \(\downarrow \) decompression required to search exact values; \(\downarrow \) intermediate resultset is small | \(\downarrow \) decompression required to search exact values; \(\uparrow \) compressing intermediate results | \(\uparrow \) compressing intermediate results |
Redundancy | \(\uparrow \) A Bloom filter to check existence | \(\uparrow \) different SPO tree or hash indexes | \(\sim \) requires full scan of data |
6.1 Matching access patterns to the design space
Enumerable range | Large range | No-range | |
---|---|---|---|
Subdivision | \(\uparrow \) faster search of single values if compatible; \(\downarrow \) more search steps if incompatible | \(\uparrow \) faster search of single values if compatible; \(\downarrow \) more search steps if incompatible | \(\downarrow \) more search steps |
Compression | \(\downarrow \) decompression to access small set; \(\uparrow \) compressing intermediate results | \(\uparrow \) faster read; \(\uparrow \) compressing intermediate results | \(\uparrow \) compressing intermediate results; \(\downarrow \) decompression required for search |
Redundancy | \(\uparrow \) hash-table search | \(\uparrow \) B+Tree search& sorted scan | \(\sim \) requires full scan |
1-hop | k-hops over p | */+ -hops over p | |
---|---|---|---|
Subdivision | \(\uparrow \) property-table or clustered on S or SP, faster search/retrieval | \(\uparrow \) clustered on P or PS, faster search/retrieval; \(\downarrow \) more search steps if divided over O | \(\uparrow \) clustered on P or PS, faster search/retrieval; \(\downarrow \) more search steps if cluster is incompatible |
Compression | \(\downarrow \) requires decompression; \(\uparrow \) compressing intermediate results | \(\downarrow \) requires decompression; \(\uparrow \) compressing intermediate results; \(\uparrow \) bitwise set intersection | \(\downarrow \) requires decompression; \(\uparrow \) compressing intermediate results; \(\uparrow \uparrow \) bitwise set intersection |
Redundancy | \(\uparrow \) hash-index on S or SP, faster search/retrieval |
Two-way same S/O/P | Two-way different S/O/P | N-way pivot | |
---|---|---|---|
Subdivision | \(\uparrow \) clustered-properties [17] less retrieval steps | \(\uparrow \) restrict search space | |
Compression | \(\downarrow \) requires decompression; \(\uparrow \) compressing intermediate results; \(\uparrow \) bitwise set intersection | ||
Redundancy | \(\uparrow \) hash-index fast search | \(\uparrow \) Hash-index fast search | \(\uparrow \) B+-tree WCOJ [38] guarantees |
All (unsorted) | All (sorted) | Distinct | Existence | Aggregate | |
---|---|---|---|---|---|
Subdivision | \(\downarrow \) more search steps if divided results; \(\uparrow \) better performance for sorted retrieval | \(\downarrow \) more search steps if divided results; \(\uparrow \) small search space if compatible. | \(\uparrow \) favorable to group-by operations | ||
Compression | \(\uparrow \) compressing intermediate results; \(\downarrow \) decompression final results | \(\uparrow \) fast skip over repeated values | \(\downarrow \) decompression required to check existence | \(\downarrow \) decompression required to compute aggregate | |
Redundancy | \(\sim \) all results are returned unprocessed | \(\uparrow \) avoid sorting over sorted representation | \(\uparrow \) in a key-value data structure access only keys | \(\uparrow \) Bloom filter search | \(\uparrow \) precomputed values |
Insert | Delete | |
---|---|---|
Subdivision | \(\uparrow \) localized updates; \(\uparrow \) smaller locks; \(\downarrow \) resizing of single subdivision; \(\downarrow \) re-distribute subdivisions | \(\uparrow \) localized updates; \(\uparrow \) smaller locks; \(\downarrow \) re-distribute subdivisions |
Compression | \(\downarrow \) requires decompression; \(\uparrow \) insertion does not require resizing; | \(\downarrow \) requires decompression; |
Redundancy | \(\downarrow \) multiple updates are required |
Access pattern | Best found | Incompatible | Unexplored |
---|---|---|---|
\(S,\,P,\,O\) 1 Constant | Hash table(+sL, +sK, +sQ), Separate Table(+sL, +sQ) | Incompatible Hash table(-sL,-sK,-sQ) | Surf(+sL, +sK, +sQ), S3(+sL, +sQ) |
\(SP,\,SO,\,PO\) 2 Constants | Hash Table(+sK, +sQ), B+Tree(+sL, +sQ) | SuRF(+sL, +sK, +sQ), \(so\,{\mapsto }p\), \(sp\,{\mapsto }o\)(+sL, +sK, +sQ), S3(+sL, +sQ) | |
SPO 3 Constants | Hash Table(+sK, +sQ), B+Tree(+sL, +sQ) | Bloom filter(+sL, +sK) | |
\( < \& >\) Closed range | B+Tree(+sL, +sQ) | Mixed Literals/URI(-sL, -sQ) | SuRF(+sL, +sK, +sQ), Skip list(+sL, +sQ) |
\(<|>\) Open range | B+Tree(+sL, +sQ) | Mixed Literals/URI(-sL, -sQ) | SuRF(+sL, +sK, +sQ), Skip list(+sL, +sQ) |
\(\mathbb {T}\) Special type range | Partial subdivision of ID ranges(+sL, +sQ) | representations mix these(-sL,-sK,-sQ) | Separate Tables/Files(+sL, +sQ) |
\(s{\rightarrow }o\) 1-hop over p | Hash Table(+sK, +sQ), Matrix(+sK, +sQ), Separate Table(+sL, +sQ) | \(sp\,{\mapsto }o\)(+sL, +sK, +sQ) | |
\(s\xrightarrow {\text {k}}o\) k-hop \(p_1,\dots ,p_k\) | Separate p matrix(+sL), Reachability Index(+sL, +sQ), VS*-tree(+sL) | \(sp\,{\mapsto }o\)(+sL, +sK) | |
\(p^{*/+}\) ?-hop over p | Separate p matrix(+sL), Reachability Index(+sL) | ||
2-way same position | Single Hash table(+sK, +sQ) | B+Tree(-sK, +sL) | \(sp\,{\mapsto }o\)(+sL, +sK, +sQ) |
2-way different position | Pre-joined materialization(+sL, +sQ); Reachability Index(+sL, +sQ) | B+Tree(-sK, -sL, -sQ) | 4-position hash-map(+sK, +sL, +sQ) |
N-way same position | VS*-tree (+sK); Indexed Property table(+sL,+sQ); Multiple B+Tree(-sK, -sL, -sQ) | Incompatible Hash table(-sL,-sK,-sQ) | Dyadic tree [43] (+sL) |
N-way arbitrary position | Multiple B+Tree(-sK, -sL, -sQ) | Incompatible Hash table(-sL,-sK,-sQ) | Dyadic tree [43] (+sL) |
All values | Single file(+sQ) | ||
Sorted values | B+tree(+sL, +sQ); Sorted table(+sL, +sQ) | Hash Table (-sL, -sQ) | |
Distinct values | Hash table(+sK, +sL) | Value counts(+sL, +sK, +sQ) | |
Existence | Hash table(+sK) | Bloom filter (+sK, +sL) | |
Aggregate | Store counts(+sL, +sK, +sQ), group-by clusters(+sL, +sK, +sQ) | Compressed Value counts(+sL, +sK, +sQ) |
6.2 A compatibility-based analysis
SELECT ?s COUNT(?o)
WHERE { ?s ex:friendOf ?o } GROUP BY ?s
. For this case, we did not find systems that have a compressed representation of \(\langle \)key;count\(\rangle \) and a seek compatible subdivision over predicates.7 Workload case analysis
7.1 Analyzed workloads
7.2 Limitations
ORDER BY
clause. Also, we excluded from this analysis those ORDER BY
clauses that followed an aggregation resulting from a GROUP BY
clause, since the sorted output could not be directly derived from a sorted retrieval operation.7.3 Results and discussion
@en
), or specific types of literals, i.e., numeric values vs. strings vs. dates. This is especially prevalent when querying multi-language knowledge graphs and when filtering values only for literals of a specific data type. This calls for indexes or subdivisions of triple objects by languages and data types. Yet, none of the data representations in the reviewed storage systems explicitly support such access patterns although the benchmark queries suggest they could be especially useful. Conversely, the relative absence of closed/open range filters from most benchmarks corresponds well with the decreased reliance of systems on B+-trees observed in Section 5.3 as the relative advantage of B+trees over hash-based lookup mechanisms is greatly diminished in the absence of closed/open range queries or other access patterns that require sorted access.