SPARQL is the standard query language for RDF data. The distinctive feature of SPARQL is the OPTIONAL operator, which allows for partial answers when complete answers are not available due to lack of information. However, optional matching is computationally expensive—query answering is PSPACE-complete. The well-designed fragment of SPARQL achieves much better computational properties by restricting the use of optional matching—query answering becomes coNP-complete. On the downside, well-designed SPARQL captures far from all real-life queries—in fact, only about half of the queries over DBpedia that use OPTIONAL are well-designed. In the present paper, we study queries outside of well-designed SPARQL. We introduce the class of weakly well-designed queries that subsumes well-designed queries and includes most common meaningful non-well-designed queries: our analysis shows that the new fragment captures over 99% of DBpedia queries with OPTIONAL. At the same time, query answering for weakly well-designed SPARQL remains coNP-complete, and our fragment is in a certain sense maximal for this complexity. We show that the fragment’s expressive power is strictly in-between well-designed and full SPARQL. Finally, we provide an intuitive normal form for weakly well-designed queries and study the complexity of containment and equivalence.
Notes
This article is part of the Topical Collection on Special Issue on Database Theory
1 Introduction
The Resource Description Framework (RDF) [14, 18, 31] is the W3C standard for representing linked data on the Web. RDF models information in terms of labelled graphs consisting of triples of resource identifiers (IRIs). The first and last IRIs in such a triple, called subject and object, represent entity resources, while the middle IRI, called predicate, represents a relation between the two entities.
SPARQL [17, 37] is the default query language for RDF graphs. First standardised in 2008 [37], SPARQL is now recognised as a key technology for the Semantic Web. This is witnessed by a recent adoption of a new version of the standard, SPARQL 1.1 [17], as well as by active development of SPARQL query engines in academia and industry, for instance, as part of the systems AllegroGraph (http://franz.com/agraph/allegrograph/), Apache Jena (http://jena.apache.org), RDF4J (http://rdf4j.org), or OpenLink Virtuoso (http://virtuoso.openlinksw.com).
Advertisement
In recent years, SPARQL has been subject to a substantial amount of theoretical research, based on the foundational work by Pérez et al. [32, 33]. In particular, we now know much about evaluation [1, 3, 4, 6, 8, 20, 22, 23, 25, 28, 34, 38], optimisation [8, 9, 12, 13, 24, 27, 35], federation [10, 11], expressive power [2, 20, 21, 25, 36, 39], and provenance tracking [15, 16] for queries from various fragments and extensions of SPARQL. These studies have had a great impact in the community, in fact influencing the evolution of SPARQL as a standard.
A distinctive feature of SPARQL as compared to SQL is the OPTIONAL operator (abbreviated as OPT in this paper). This operator was introduced to “not reject (solutions) because some part of the query pattern does not match” [37]. For instance, consider the SPARQL query
which retrieves all person IDs from the graph together with their names; names, however, are optional—if the graph does not contain information about the name of a person, the person ID is still retrieved but the variable ?n is left undefined in the answer. For instance, query (1) has two answers over the graph G in Fig. 1a, where the second answer is partial (see Fig. 1b). However, if we extend G with a triple supplying a name for P2, the second answer will include this name.
×
The OPT operator accounts in a natural way for the open world assumption and the fundamental incompleteness of the Web. However, evaluating queries that use OPT is computationally expensive—Pérez et al. [33] showed PSpace-completeness of SPARQL query evaluation, and Schmidt et al. [38] refined this result by proving PSpace-hardness even for queries using no operators besides OPT. This is not surprising given that SPARQL queries are equivalent in expressive power to first-order logic queries, and translations in both directions can be done in polynomial time [2, 25, 36].
This spurred a search for restrictions on the use of OPT that would ensure lower complexity of query evaluation. It was also recognised that queries that are difficult to evaluate are often unintuitive. For instance, they may produce less specified answers (i.e., answers with fewer bound variables) as the graph over which they are evaluated grows larger.
Advertisement
Pérez et al. [33] introduced the well-designed fragment of SPARQL queries by imposing a syntactic restriction on the use of variables in OPT-expressions. Roughly speaking, each variable in the optional (i.e., right) argument of an OPT-expression should either appear in the mandatory (i.e., left) argument or be globally fresh for the query, i.e., appear nowhere outside of the argument. Well-designed queries have lower complexity of query evaluation—the problem is coNP-complete (provided all the variables in the query are selected). Moreover, such queries have a more intuitive behaviour than arbitrary SPARQL queries; in particular, they enjoy the monotonicity property that we observed for query (1): each partial answer over a graph can potentially be extended to undefined variables if the graph is completed with the missing information, and the more information we have the more specified are the answers. Well-designed queries can be efficiently transformed to an intuitive normal form allowing for a transparent graphical representation of queries as trees [27, 35]. Hence, many recent studies concentrate partially [23, 25, 27, 40, 41] or entirely [1, 8, 35] on well-designed queries.
Such a success of well-designed queries may lead to the impression that non-well-designed SPARQL queries are just a useless side effect of the early specification. But is this impression justified by the use of SPARQL in practice? To answer this question, a comprehensive analysis of real-life queries is required. We are aware of two works that analyse the distribution of operators in SPARQL queries asked over DBpedia [7, 34]. Both studies show that OPT is used in a non-negligible amount of practical queries. However, only Picalausa and Vansummeren [34] go further and analyse how many of these queries are well-designed; and the result is quite interesting—well-designed queries make up only about half of all queries with OPT. In other words, well-designed queries are common, but by far not exclusive.
The main goal of this paper is to investigate SPARQL queries beyond the well-designed fragment. We wanted to see if the well-designedness condition could be extended so as to include most practical queries while preserving good computational properties. The main result of our study is very positive—we identified a new fragment of SPARQL queries, called weakly well-designed queries, that covers over 99% of queries over DBpedia and has the same complexity of query evaluation as the well-designed fragment. We also show that our fragment is in a sense maximal for this complexity.
We next describe our results and techniques in more detail. Our first step was to identify typical real-life queries that are not well-designed. We analysed DBpedia query logs in recent USEWOD research datasets [29, 30] and found two interesting types of non-well-designed queries. The first type is exemplified by the following query:
This query is clearly not well-designed because variable ?n, binding the name of a person, appears in two different unrelated optional parts. Let us analyse answers to this query over different graphs. On graph G in Fig. 1a the result is exactly the same as for query (1), shown in Fig. 1b, simply because the IRI v_card:name is not present in G, and so cannot be matched against the second optional part of the query. Similarly, on graph G′ in Fig. 1c, where the source of the name and the name itself are different, the result is as in Fig. 1d. In this case, the first optional part in the query does not match anything in the graph so the variable ?n is left unbound at this point; then the second optional is matched, and the variable is assigned with the name from v_card. More interestingly, query (2) evaluated over the graph G ∪ G′ once again yields the result in Fig. 1b. Indeed, in this case, the first optional part has a match again and ?n is assigned the value Ana; then, this variable is already bound and there is no match for the second optional part that agrees with this value, meaning that the alternative v_card name is disregarded by the query. To summarise, query (2) is once again looking for person IDs and, optionally, their names. Now, however, names are collected from two different sources, foaf and v_card, where the first source is given preference over the second (maybe because it is considered more reliable or more informative, or for some other reason). In other words, if we know the foaf name of a person, it is returned as part of the answer regardless of their v_card name; however, if there is no foaf name, then the v_card name is also acceptable and should be returned; variable ?n is left unbound only if the name cannot be extracted from either source.
Of course, preference patterns encountered in real-life queries are often more complex. Still, we will see that in most cases they do not increase the complexity of query evaluation.
The query uses FILTER, a standard SPARQL operator that admits only answers conforming to a specified constraint. Again, this query is not well-designed because the FILTER constraint mentions the variable ?n, which occurs in the optional part of the query but not in the mandatory part. However, the intention of the query is quite clear: it searches for people whose names are not known to be Ana, including people whose names are unknown.
This use of FILTER is in fact very common in real-life queries. Moreover, it is intuitive as long as FILTER is essentially the outermost operator in the query, as it is in our example. We will see that in all such cases FILTER cannot lead to an increase in complexity.
Having isolated these typical uses of non-well-de-signed-ness, we identify a new fragment of SPARQL that (a) includes all queries of the above two types, (b) subsumes well-designed queries, and (c) has the same complexity of query evaluation as well-designed queries. We call such queries weakly well-designed. They are the maximal fragment without structural restrictions on conjunctive blocks and filter conditions that has the above properties. Our analysis shows that more than 99% of DBpedia queries with OPT are weakly well-designed.
Besides low complexity of query evaluation, we establish a few more useful properties of weakly well-designed queries, which are summarised in the following outline of the paper. After introducing the syntax and semantics of SPARQL in Section 2, we formally define our new fragment in Section 3. In Section 4, we show that, similarly to the well-designed case, weakly well-designed queries can be transformed to an intuitive normal form, which allows for a natural graphical representation as constraint pattern trees. Using this representation, in Section 5, we formally show that the step from well-designed to weakly well-designed queries does not increase complexity of query evaluation; minimal relaxations of weak well-designedness, however, already lead to a complexity jump. In Section 6, we compare the expressive power of our fragment (and its extensions with additional operators) with well-designed queries and unrestricted SPARQL queries; in all cases, we show that the expressivity of weakly well-designed queries lies strictly in-between well-designed and unrestricted queries. In Section 7, we study static analysis problems for weakly well-designed queries and establish \({{\Pi }_{2}^{p}}\)-completeness of equivalence and containment. Finally, in Section 8, we detail our analysis of DBpedia logs.
This article significantly extends the conference paper [19]. Besides providing full proofs of our technical claims, we have extended the analysis section and updated the evaluation to use more recent datasets. Furthermore, we have removed the erroneous claim that queries over unions of weakly well-designed patterns have the same expressive power as unrestricted SPARQL queries; on the contrary, we show that the former are strictly less expressive than the latter.
2 SPARQL Query Language
We begin by formally introducing the syntax and semantics of SPARQL that we adopt in this paper. Our formal setup mostly follows [33], which has some differences from the W3C specification [17, 37]; in particular, we use two-placed OPT and two-valued FILTER (conditional OPT and errors in FILTER evaluation as in the standard are expressible in our formalisation [2, 21]), do not consider blank nodes (their presence in RDF graphs would not change any of our results), and adopt set semantics, leaving multiset answers for future work.
RDF Graphs
An RDF graph is a labelled graph where nodes can also serve as edge labels. Formally, let I be a set of IRIs. Then an RDF triple is a tuple (s, p, o) from I × I × I, where s is called subject, ppredicate, and oobject. An RDF graph is a finite set of RDF triples.
SPARQL Syntax
Let X be an infinite set {?x, ?y, …} of variables, disjoint from I. Filter constraints are conditions of the form
⊤, ?x = u, ?x = ?y, or bound(?x) for ?x, ?y in X and u ∈ I (these constraints are called atomic),
¬R1, R1 ∧ R2, or R1 ∨ R2 for filter constraints R1 and R2.
A basic pattern is a possibly empty set of triples from (I ∪ X) × (I ∪ X) × (I ∪ X) (to avoid notational clutter, in examples we will often omit braces when writing singleton basic patterns, e.g., we will write (?x, u, ?y) instead of {(?x, u, ?y)}). Then, SPARQL (graph) patternsP are defined by the grammar
where B ranges over basic patterns and R over filter constraints. Additionally, we require all filter constraints to be safe, that is, vars(R) ⊆ vars(P) for every pattern (PFILTERR), where vars(S) is the set of all variables in S (which can be a pattern, constraint, etc.) When needed, we distinguish between patterns by their top-level operator; e.g., we write OPT-pattern or FILTER-pattern.
We write \(\mathcal {U}\) for the set of all patterns. We also distinguish the fragment \(\mathcal {P}\) of \(\mathcal {U}\) that consists of all UNION-free patterns, that is, patterns that do not use the UNION operator.
Projection is realised in SPARQL by means of queries with select result form, or queries for short, which are expressions of the form
$$ \mathsf{SELECT} \; X \; \mathsf{WHERE} \; P, $$
(4)
where X is a set of variables and P is a graph pattern. We write \(\mathcal {S}\) for the set of all queries. The set of all triples in basic patterns of a query Q is denoted triples(Q).
Note that every pattern P can be seen as a query of the form (4) where X = vars(P). Hence, all definitions that refer to “queries” implicitly extend to patterns in the obvious way.
SPARQL Semantics
The semantics of graph patterns is defined in terms of mappings, that is, partial functions from variables to IRIs. The domaindom(μ) of a mapping μ is the set of variables on which μ is defined. Two mappings μ1 and μ2 are compatible (written μ1 ∼ μ2) if μ1(?x) = μ2(?x) for all variables ?x ∈ dom(μ1) ∩ dom(μ2). If μ1 ∼ μ2, then μ1 ∪ μ2 constitutes a mapping with domain dom(μ1) ∪ dom(μ2) that coincides with μ1 on dom(μ1) and with μ2 on dom(μ2). Given two sets of mappings Ω1 and Ω2, we define their join, union and difference as follows:
Based on these, the left outer join operation is defined as
×
Given a graph G, the evaluation [[P]]G of a graph pattern P over G is defined as follows:
1.
if B is a basic pattern, then [[B]]G = {μ : vars(B) → I∣μ(B) ⊆ G};
2.
[[(P1ANDP2)]]G = [[P1]]G ⋈ [[P2]]G;
3.
4.
[[(P1UNIONP2)]]G = [[P1]]G ∪ [[P2]]G;
5.
[[(P′FILTERR)]]G = {μ∣μ ∈ [[P′]]G and μ ⊧ R}, where μsatisfies a filter constraint R, denoted by μ ⊧ R, if one of the following holds:
R is ⊤;
R is ?x = u, ?x ∈ dom(μ), and μ(?x) = u;
R is ?x = ?y, {?x, ?y}⊆ dom(μ), and μ(?x) = μ(?y);
R is bound(?x) and ?x ∈ dom(μ);
R is a Boolean combination of filter constraints evaluating to true under the usual interpretation of ¬, ∧, and ∨.
Let μ|X be the projection of a mapping μ to variables X, that is, μ|X(?x) = μ(?x) if ?x ∈ X and μ|X(?x) is undefined if ?x ∉ X. The evaluation [[Q]]G of a query Q of the form (4) is the set of all mappings μ|X such that μ ∈ [[P]]G.
Finally, a solution to a query (or pattern) Q over G is a mapping μ such that μ ∈ [[Q]]G.
3 Weakly Well-Designed Patterns
We begin by recalling the notion of well-designed patterns and then formulate our generalisation. For now, we focus on the fragment \(\mathcal {P}\) of UNION-free patterns (also known as the AND- OPT- FILTER fragment of SPARQL), leaving the operators UNION and SELECT for later sections.
Note that a given pattern can occur more than once within a larger pattern. In what follows we will sometimes need to distinguish between a (sub-)pattern P as a possibly repeated building block of another pattern P′ and its occurrences in P′, that is, unique subtrees in the parse tree. Then, the left (right) argument of an occurrence i is the subtree rooted in the left (right) child of the root of i in the parse tree, and an occurrence i is inside an occurrence j if the root of i is a descendant of the root of j.
Definition 1
(Pérez et al. [33]) A pattern P from \(\mathcal {P}\) is well-designed (or wd-pattern, for short) if for every occurrence i of an OPT-pattern P1OPTP2 in P the variables from vars(P2) ∖ vars(P1) occur in P only inside (the labels of) i.
We write \(\mathcal {P}_{\text {wd}}\) for the fragment of wd-patterns. Such patterns comply with the basic intuition for optional matching in SPARQL: “do not reject (solutions) because some part of the query pattern does not match” [37]; indeed, our canonical use case (1) is clearly well-designed. Evaluation of wd-patterns, that is, checking if μ ∈ [[P]]G for a mapping μ, graph G and pattern \(P\in \mathcal {P}_{\text {wd}}\), is coNP-complete (in combined complexity), as opposed to PSpace-completeness for \(\mathcal {P}\) [33, 38]. The high complexity of unrestricted patterns is partially due to the fact that unrestricted combinations of OPT and FILTER allow to express nesting of the difference operator DIFF with semantics [[P1DIFFP2]]G = [[P1]]G∖[[P2]]G (unless P1 or P2 are empty basic patterns, see [21] for details):
where ?x, ?y and ?z do not occur in vars(P1) ∪ vars(P2). This property is well-known [2, 21, 33], and has been usually believed to be an important source of non-well-designed patterns in practice. We challenge this belief by answering differently the question on the prevalent structure of real-life queries beyond the well-designed fragment. This question is not just of theoretical interest: as previous studies [34] show (and our analysis confirms), about half of queries with OPT asked over DBpedia are not well-designed.
Next we discuss two sources of non-well-designedness in patterns as revealed by the example queries (2) and (3) in the introduction—one based on OPT and another one on FILTER.
Source 1.
There are two substantially different ways of nesting the OPT operator in patterns:
Non-well-designed nesting of type (Opt-R) is responsible for the PSpace-hardness of query evaluation [33, 38]. Moreover, such nesting is not very intuitive unless well-designed. On the contrary, as we saw in the introduction, non-well-designed nesting of type (Opt-L) can be used for prioritising some parts of patterns to others, and is indeed used in real life. As we will see later, nesting of type (Opt-L) cannot lead to high complexity of evaluation.
Source 2.
Well-designedness can be violated by using “dangerous” variables from the right argument of OPT in filter constraints. In particular, patterns of the form (P1OPTP2) FILTERR with R using a variable from vars(P2) ∖vars(P1) are not well-designed, but rather frequent in practice. However, such patterns almost never occur inside the right argument of other OPT-patterns. We will see that if we restrict the usage of such filters to the “top level”, we preserve the good computational properties of wd-patterns.
Motivated by these observations, we considerably generalise the notion of wd-patterns to allow for useful queries like (2) and (3) while retaining important properties of such patterns. We start with two auxiliary notions.
Definition 2
Given a pattern P, an occurrence i1 in P dominates an occurrence i2 if there exists an occurrence j of an OPT-pattern such that i1 is inside the left argument of j and i2 is inside the right argument.
Definition 3
An occurrence i of a FILTER-pattern P′FILTERR in P is top-level if there is no occurrence j of an OPT-pattern such that i is inside the right argument of j.
We are ready to give the main definition of this paper.
Definition 4
A pattern \(P \in \mathcal {P}\) is weakly well-designed (or wwd-pattern, for short) if, for each occurrence i of an OPT-subpattern P1OPTP2, the variables in vars(P2)∖vars(P1) appear outside i only in
subpatterns whose occurrences are dominated by i, and
constraints of top-level occurrences of FILTER-patterns.
We write \(\mathcal {P}_{\text {wwd}}\) for the fragment of wwd-patterns. They extend wd-patterns by allowing variables from the right argument of an OPT-subpattern that are not “guarded” by the left argument to appear in certain positions outside of the subpattern. Note that the patterns of queries (2) and (3) are wwd-patterns. Also, patterns which allow only for OPT nesting of type (Opt-L) are always weakly well-designed, same as the pattern on the right hand side of (5), which expresses DIFF. However, patterns that have subpatterns of the atter form in the right argument of OPT are not weakly well-designed. Next we give a few more examples.
Example 1
Consider the following patterns and their parse trees in Fig. 2 (we write ?x ≠ ?y for ¬(?x = ?y)):
Pattern (6) is not well-designed because of variable ?z, but is weakly well-designed since the occurrence of (?y, c, ?z) dominates (?x, d, ?z). However, the similar pattern (7) is not weakly well-designed because the occurrence of the inner OPT-pattern with the second occurrence of ?z does not dominate the first. Pattern (8) is weakly well-designed since the FILTER-pattern (which is not dominated by the inner OPT-pattern) is top-level, but pattern (9) is not, because of variable ?w in a non-top-level FILTER.
×
Proposition 1
Checking whether aUNION-freepattern P belongs to the fragment\(\mathcal {P}_{\text {wwd}}\)can be done in timeO(|P|2),where |P| is the length of the string representation of P.
Proof
First note that a UNION-free pattern P is weakly well-designed if and only if so is the pattern rm_toplevel_filters(P), which is obtained from P by removing all top-level occurrences of filters. The operation rm_toplevel_filters can be implemented in linear time by the recursive procedure in Fig. 3a.
×
Next consider the recursive procedure is_wwd in Fig. 3b, where sort(S) denotes a sorted, repetition-free list representation of a set S.
Given a UNION-free pattern P without top-level filters, it is easily seen that is_wwd(P) returns a tuple of the form (true, vs, ws) if and only if P is weakly well-designed, where ws is the sorted list of “unguarded” variables in P, that is, variables occurring in the second argument of an OPT-subpattern P′ of P but not in the first argument of P′, and vs = sort(vars(P))∖ws. Procedure is_wwd can be implemented in quadratic time since sort (which may take time O(n log n)) is only applied to atomic subexpressions and set operations on sorted lists take linear time. □
4 OPT-FILTER-Normal Form and Constraint Pattern Trees
One of the key properties of wd-patterns is that they can always be converted to a so-called OPT-normal form, in which all AND- and FILTER-subpatterns are OPT-free [33]. Also, FILTER-free patterns in OPT-normal form can be naturally represented as trees of a special form [27, 35], which give a good intuition for the evaluation and optimisation of such patterns. In this section, we show that these notions can be generalised to wwd-patterns.
Definition 5
A pattern \(P\in \mathcal {P}\) is in OPT- FILTER-normal form (or OF-normal form for short) if it adheres to the grammar
$$\begin{array}{lll} P &::= & F ~\mid~ (P~ {{\mathsf{FILTER}}} ~R) ~\mid~ (P~{{\mathsf{OPT}}}~S),\\ S & ::= & F ~\mid~ (S~{{\mathsf{OPT}}}~S), \\ F & ::= & (B~ {{\mathsf{FILTER}}} ~R), \end{array} $$
where B ranges over basic patterns and R over filter constraints.
In other words, the parse tree of a pattern in OF-normal form can be stratified as follows:
1.
(occurrences of) basic patterns as the bottom layer,
2.
a FILTER on top of each basic pattern as the middle layer,
3.
a combination of OPT and FILTER as the top layer;
moreover, each occurrence of a FILTER-pattern in the top layer is top-level (according to Definition 3). Note that our normal form is AND-free: all conjunctions are expressed via basic patterns.
Example 2
None of the four patterns in Example 1 are in OF-normal form. However, the first three of them can be easily normalised by replacing each triple t with t⊤, where P⊤ is an abbreviation of PFILTER ⊤ for a pattern P. Also, compare the pattern
which is not, because the outer FILTER is in the right argument of the outermost OPT.
As shown by Letelier et al. [27], FILTER-free patterns in OPT-normal form can be represented by means of so-called pattern trees. We next show that this representation can be naturally extended to patterns in OF-normal form.
Definition 6
Let P be a pattern in OF-normal form. The constraint pattern tree (CPT)\(\mathcal {T}(P)\) of P is the directed, ordered, labelled, rooted tree recursively constructed as follows (in this definition we abuse notation and confuse patterns and their occurrences; strictly speaking, we create a fresh sub-tree for each occurrence, so the resulting object is always a tree):
1.
if B is a basic pattern then \(\mathcal {T}(B~ {{\mathsf {FILTER}}}~ R)\) is a single node v labelled by the pair (B, R);
2.
if P′ is not a basic pattern then \(\mathcal {T}(P^{\prime }~ {{\mathsf {FILTER}}} ~R)\) is obtained by adding a special node labelled by R as the last child of the root of \(\mathcal {T}(P^{\prime })\);
3.
\(\mathcal {T}(P_{1}~{{\mathsf {OPT}}}~P_{2})\) is the tree obtained from \(\mathcal {T}(P_{1})\) and \(\mathcal {T}(P_{2})\) by adding the root of \(\mathcal {T}(P_{2})\) as the last child of the root of \(\mathcal {T}(P_{1})\).
By definition, there is a one-to-one correspondence between patterns in OF-normal form and CPTs. Hence, such trees can be seen as a convenient representation of patterns in OF-normal form. Unlike parse trees, which represent the syntactic shape of patterns, CPTs show the semantic structure of OPT and FILTER nesting. Figure 4 shows how OPT nestings of types (Opt-R) and (Opt-L) are represented in both formats. Note that CPTs treat different FILTER-subpatterns differently: if the filter is over a basic pattern, the constraint of the FILTER is paired with this pattern; however, if the filter is over an OPT-subpattern, then the constraint is represented by a separate special node. Moreover, since in the second case the FILTER-pattern must be top-level, special nodes can only occur in CPTs as children of the root. For instance, the CPT of the example pattern (10) is given in Fig. 5a.
×
×
Proposition 2
Let P be a pattern inOF-normalform. Then every special node in\(\mathcal {T}(P)\)is a child of the root.
Proof
Let v be a special node in \(\mathcal {T}(P)\). Then v is obtained from a subpattern P′FILTERR where P′ is not basic. Hence, by definition of the OF-normal form, P must have the form
(for some n ≥ 0) where S1, …, Sn contain only FILTER-subpatterns over basic patterns. Thus, the root of \(\mathcal {T}(P^{\prime })\) is also the root of \(\mathcal {T}(P)\), and the claim follows. □
Next we show that each wwd-pattern can be converted to OF-normal form and hence can be represented by a CPT. To prove this statement we make use of a number of equivalences. Formally, a pattern P1 is equivalent to a pattern P2 (written P1 ≡ P2) if [[P1]]G = [[P2]]G holds for any graph G. There are several equivalences, such as associativity and commutativity of AND, as well as filter decompositions, such as PFILTER (R1 ∧ R2) ≡ (PFILTERR1) FILTERR2, which hold for all patterns (see [38] for an extensive list). Moreover, the key equivalences used in [33] for normalising wd-patterns can easily be adapted to serve our needs.
Proposition 3
LetP1, P2, P3be patterns and R a filter constraint such thatvars(P2) ∩ vars(P3) ⊆ vars(P1) andvars(P2) ∩ vars(R) ⊆ vars(P1).Then the following equivalences hold:
Both equivalences are essentially shown in [33]. While stated for well-designed patterns, the proof only exploits the properties vars(P2) ∩ vars(P3) ⊆ vars(P1) and vars(P2) ∩ vars(R) ⊆ vars(P1), which are satisfied not only by well-designed patterns, but also by weakly well-designed patterns □
Since all the equivalences preserve weak well-designedness, we obtain the desired result.
Proposition 4
Each wwd-pattern P is equivalent to a wwd-pattern inOF-normalform of sizeO(|P|).
Proof
We call a pattern P pre-normal if it adheres to the grammar that is the same as the one in Definition 5 except that the category F is extended as follows:
$$F::= B ~\mid~ (F\,{{\mathsf{FILTER}}}\,R) ~\mid~ (F\,{{\mathsf{AND}}}\,F). $$
Given a pattern P, let ||P|| be the sum of the sizes of all AND-subpatterns and all FILTER-subpatterns of P (where different occurrences of each pattern are counted separately). Consider a wwd-pattern P that is not pre-normal. Then P contains a subpattern P′ of one of the following two forms (modulo commutativity of AND): (P1OPTP2) ANDP3 and (P1OPTP2) FILTERR with P′ not top-level. In both cases we can rewrite P to a pattern S not increasing |⋅| and strictly decreasing ||⋅|| as follows.
Let P′ = (P1OPTP2) ANDP3. Since P is weakly well-designed and the occurrence of P3 is not dominated by the occurrence of P1OPTP2, we have vars(P3) ∩ vars(P2) ⊆ vars(P1). Therefore, using the first equivalence in Proposition 3, we can rewrite P to a pattern S by replacing P′ with (P1ANDP3) OPTP2. Moreover, we have |P| = |S| and ||P|| > ||S||.
Let P′ = (P1OPTP2) FILTERR where the occurrence of P′ is not top-level. Since P is weakly well-designed, we then have vars(R) ∩ vars(P2) ⊆ vars(P1), and thus, with the second equivalence in Proposition 3, we can rewrite P to a pattern S by replacing P′ with (P1FILTERR) OPTP2. Moreover, we have |P| = |S| and ||P|| > ||S||.
Since this rewriting strictly decreases ||⋅||, its repeated application to P terminates and yields a pre-normal pattern S equivalent to P with |S| = |P|.
Finally, S can be transformed to OF-normal form by replacing every occurrence of an AND- FILTER combination of basic patterns by BFILTERR where B consists of all triples in the basic patterns and R is a conjunction of all the filter conditions (if there are no filters in the combination, then R is ⊤). Clearly, this transformation is equivalence-preserving and linear in |S|. □
Relying on this proposition, in the rest of the paper we silently assume that all wwd-patterns are in OF-normal form and hence can be represented by CPTs.
We next transfer the notion of weak well-designedness to CPTs. Given a pattern P in OF-normal form, let ≺ be the strict topological sorting of the nodes in \(\mathcal {T}(P)\) computed by a depth first search traversal visiting the children of a node according to their ordering (i.e., v ≺ u holds if v is visited before u).
Lemma 1
Let P be a pattern inOF-normalform andP′ = P1OPTP2be a subpattern of P. Thenv ≺ wfor every two nodes v, w in\(\mathcal {T}(P)\)such that v is in the subtree of\(\mathcal {T}(P)\)corresponding toP1and w is in the subtree corresponding toP2.
Proof
The claim follows since \(\mathcal {T}(P^{\prime })\) is constructed by attaching \(\mathcal {T}(P_{2})\) as the last child to the root of \(\mathcal {T}(P_{1})\). □
In the following proposition, vars(u) for a node u of a CPT stands for the set of all variables in the label of u.
Proposition 5
A pattern P inOF-normalform is weakly well-designed if and only if, for each edge (v, u) with non-special u in the CPT\(\mathcal {T}(P)\),every variable ?x ∈ vars(u) ∖ vars(v) occurs only in nodes w such thatv ≺ w.The pattern is well-designed if and only if for every variable ?xin P the set of all nodes v in\(\mathcal {T}(P)\)with ?x ∈ vars(v) is connected.
Proof
For the forward direction of the first statement, suppose P is weakly well-designed. We proceed by induction on the structure of P and consider the following cases.
Let P = BFILTERR where B is basic. Then the claim is vacuous.
Let P = P1FILTERR where P1 is not basic. By the inductive hypothesis, the claim holds for \(\mathcal {T}(P_{1})\). Moreover, \(\mathcal {T}(P)\) differs from \(\mathcal {T}(P_{1})\) only in the special node labelled with R, and the claim follows by Proposition 2.
Let P = P1OPTP2. By the inductive hypothesis, the claim holds for \(\mathcal {T}(P_{1})\) and \(\mathcal {T}(P_{2})\). Thus, by Lemma 1, it suffices to show that for every edge (v, u) in \(\mathcal {T}(P_{2})\) (with non-special u by definition), no variable ?x ∈ vars(u) ∖ vars(v) occurs in \(\mathcal {T}(P_{1})\). Suppose for contradiction that this property is violated for some (v, u) and ?x. Then P2 has a subpattern \(P^{\prime }=P^{\prime }_{1}~{{\mathsf {OPT}}}~P^{\prime }_{2}\) such that \(\mathcal {T}(P^{\prime }_{1})\) is a subtree of \(\mathcal {T}(P)\) rooted at v and \(\mathcal {T}(P^{\prime }_{2})\) is the complete subtree of \(\mathcal {T}(P)\) rooted at u. Moreover, ?x occurs in P1, and thus outside P′. Since all FILTER-subpatterns in P are safe, we can assume without loss of generality that the occurrence of ?x in P1 is not in a filter constraint. However, this contradicts the assumption that P is weakly well-designed since the occurrence of ?x in P1 is not dominated by the occurrence of P′.
For the backward direction of the first claim, suppose P is not a wwd-pattern. Then P has a subpattern \(P^{\prime }=P^{\prime }_{1}~{{\mathsf {OPT}}}~ P^{\prime }_{2}\), with v the root of \(\mathcal {T}(P^{\prime })\) in \(\mathcal {T}(P)\) and u the child of v corresponding to \(\mathcal {T}(P^{\prime }_{2})\), and a variable \(?x\in {\mathsf {vars}(P^{\prime }_{2})}\setminus {\mathsf {vars}(P^{\prime }_{1})}\) such that ?x ∈ vars(u) and, for some subpattern P1OPTP2 of P, ?x occurs in P1 and P′ occurs in P2. Since \(?x\in {\mathsf {vars}(P^{\prime }_{2})}\setminus {\mathsf {vars}(P^{\prime }_{1})}\) and ?x ∈ vars(u), we have ?x ∈ vars(u) ∖vars(v). Thus, by Lemma 1, we have v ⊀ w, where w is a node in \(\mathcal {T}(P_{1})\) with an occurrence of ?x.
The second claim can be proved analogously. □
Note that if a pattern is FILTER-free, its OF-normal form coincides with the OPT-normal form in [33] (modulo tautological filters), and its CPT is the pattern tree from [27, 35]. In fact, the second part of Proposition 5 generalises an observation from [27] to the case with filters. An important difference to pattern trees is that in our case the order of children of a node is semantically relevant since wwd-patterns do not satisfy the equivalence
This equivalence, established in [32], holds whenever (vars(P2) ∩ vars(P3)) ⊆ vars(P1), which is always the case for wd-patterns but not for wwd-patterns, as can be seen on query (2).
We conclude this section with a property that is unique to wwd-patterns: each wwd-pattern is equivalent to a pattern whose corresponding CPT has depth one.
Definition 7
A pattern in \(\mathcal {P}\) is in depth-one normal form if it has the structure
We first show that any solution to the left hand side is also a solution to the right hand side. Let G be a graph and let μ ∈ [[P1OPT (P2OPTP3)]]G. We distinguish three cases.
Let μ ∈ [[P1]]G ⋈ ([[P2]]G ⋈ [[P3]]G). Then, by Lemma 2, we have [[P2]]G = [[P2]]G ⋈ [[P2]]G. Consequently, μ ∈ ([[P1]]G ⋈ [[P2]]G) ⋈ ([[P2]]G ⋈ [[P3]]G), and the claim follows.
Let μ ∈ [[P1]]G ⋈ ([[P2]]G ∖ [[P3]]G). Then μ = μ1 ∪ μ2 such that μ1 ∈ [[P1]]G, μ2 ∈ [[P2]]G, and for every μ3 ∈ [[P3]]G, μ2 ≁ μ3. Since every mapping in [[P2ANDP3]]G is an extension of some mapping in [[P3]]G, no mapping in [[P2ANDP3]]G is compatible with μ2, and hence with μ. Therefore, μ ∈ ([[P1]]G ⋈ [[P2]]G) ∖ [[P2ANDP3]]G, and the claim follows.
Let μ ∈ [[P1]]G ∖ [[P2OPTP3]]G. Then μ ∈ [[P1]]G and is incompatible with any mapping in [[P2OPTP3]]G. Moreover, since vars(P1) ∩ vars(P3) ⊆ vars(P2), μ is incompatible with any mapping in [[P2]]G, and consequently also with any mapping in [[P2ANDP3]]G. Therefore, μ ∈ ([[P1]]G ∖ [[P2]]G) ∖ [[P2ANDP3]]G, and the claim follows.
For the other direction, suppose μ ∈ [[(P1OPTP2) OPT (P2ANDP3)]]G. We distinguish three cases.
Let μ ∈ ([[P1]]G ⋈ [[P2]]G) ⋈ ([[P2]]G ⋈ [[P3]]G). Then, by Lemma 2, we have μ ∈ [[P1]]G ⋈ ([[P2]]G ⋈ [[P3]]G), and the claim follows.
Let μ ∈ ([[P1]]G ⋈ [[P2]]G) ∖ ([[P2]]G ⋈ [[P3]]G). Then μ = μ1 ∪ μ2 such that μ1 ∈ [[P1]]G, μ2 ∈ [[P2]]G, and μ is incompatible with every mapping in [[P2]]G ⋈ [[P3]]G. Since vars(P1) ∩ vars(P3) ⊆ vars(P2), this implies that [[P2]]G ⋈ [[P3]]G is empty, that is, μ2 is incompatible with every mapping in [[P3]]G. Therefore, μ2 ∈ [[P2]]G ∖ [[P3]]G, and thus μ ∈ [[P1]]G ⋈ ([[P2]]G ∖ [[P3]]G). The claim follows.
Let μ ∈ [[P1]]G ∖ [[P2]]G. Since every mapping in [[P2OPTP3]]G extends a mapping in [[P2]]G, we have that μ ∈ [[P1]]G ∖ [[P2OPTP3]]G, and the claim follows.
□
Applied from left to right, equivalence (13) preserves weak well-designedness (but not well-designedness). Each such application transforms a weakly well-designed OPT nesting of type (Opt-R) to a nesting of type (Opt-L), decreasing the depth of the CPT.
Corollary 1
Every wwd-pattern is equivalent to a wwd-pattern in depth-one normal form.
For instance, pattern (10) is equivalent to the pattern
represented by the CPT in Fig. 5b. Such “flat” patterns are attractive in practice because of their regular structure. However, “flattening” a pattern can incur an exponential blow-up in size. Hence, in the rest of the paper we consider arbitrary wwd-patterns in OF-normal form rather than restricting our attention to depth-one-normal patterns.
5 Evaluation of wwd-Patterns
In this section, we look at the query answering problem for wwd-patterns and their extensions with union and projection. We show that in all three cases, complexity remains the same as for wd-patterns. To obtain these results, we develop several new techniques.
Formally, we look at the following decision problem for a given SPARQL fragment \(\mathcal {L}\).
×
It is known that \(\textsc {Eval}({\mathcal {U}})\) for general patterns \(\mathcal {U}\) is PSpace-complete [33], and the result easily propagates to queries with projection (i.e., \(\mathcal {S}\)) [27]. For wd-patterns, the evaluation problem is coNP-complete, and can be solved by exploiting the following idea of [27].
Suppose we are given a wd-pattern P in OPT-normal form (for simplicity, assume that P is FILTER-free), a graph G, and a mapping μ. First, we look for a subtree of \(\mathcal {T}(P)\) that includes the root of \(\mathcal {T}(P)\), contains precisely the variables in domμ, and “matches” G under μ (i.e., images of all its triples under μ are contained in G). This is doable in polynomial time. If such a subtree does not exist, then μ cannot be a solution. Otherwise, the subtree witnesses that μ is a part of a solution to P. Finally, to verify that μ is a complete solution, we need to check that the subtree is maximal, that is, cannot be extended to any more nodes in \(\mathcal {T}(P)\) with a match in G. There are linearly many such nodes to check, and each check can be performed in coNP. So, the overall algorithm runs in coNP.
Inspired by this idea, we next show that the low evaluation complexity of wd-patterns transfers to wwd-patterns by developing a coNP algorithm for Eval(\({\mathcal {P}_{\text {wwd}}}\)).
Let P be a wwd-pattern in OF-normal form. An r-subtree of \(\mathcal {T}(P)\) is a subtree containing the root of \(\mathcal {T}(P)\) and all its special children. Every r-subtree \(\mathcal {T}(P^{\prime })\) of \(\mathcal {T}(P)\) is also a CPT representing a wwd-pattern P′ that can be obtained from P by dropping the right arguments of some OPT-subpatterns (a transformation known from [33]). A child of an r-subtree \(\mathcal {T}(P^{\prime })\) of \(\mathcal {T}(P)\) is a node in \(\mathcal {T}(P)\) that is not contained in \(\mathcal {T}(P^{\prime })\) but whose parent is.
Definition 8
A mapping μ is a potential partial solution (or pp-solution for short) to a wwd-pattern P over a graph G if there is an r-subtree \(\mathcal {T}(P^{\prime })\) of \(\mathcal {T}(P)\) such that dom(μ) = vars(P′), μ(triples(P′)) ⊆ G, and μ ⊧ R for the constraint R of each ordinary node in \(\mathcal {T}(P^{\prime })\).
A pp-solution μ to P over G can be witnessed by several r-subtrees. However, the union of such r-subtrees is also a witness. Hence, there exists a unique maximal witnessing r-subtree, denoted \(\mathcal {T}(P_{\mu })\), with Pμ being the corresponding wwd-pattern.
Potential partial solutions generalise “partial solutions” as defined in [33] for wd-patterns. There, every “partial solution” is either a solution or can be extended to one. This is not the case for wwd-patterns. While every solution is clearly a pp-solution, not every pp-solution can be extended to a real one. Real solutions may not just extend pp-solutions by assigning previously undefined variables but can also override variable bindings established in some node v of \(\mathcal {T}(P_{\mu })\) by extending to a child of \(\mathcal {T}(P_{\mu })\) that precedes v according to the order ≺.
An additional complication is the presence of non-well-designed top-level filters. Note that pp-solutions are only required to satisfy the constraints of ordinary nodes in the corresponding CPT, thus ignoring top-level filters. Indeed, requiring pp-solutions to satisfy constraints of top-level filters would be too strong since real solutions do not generally satisfy this property, as demonstrated by the following example.
Example 3
Consider the graph G = {(1, a, 1),(3, a, 3)} and wwd-pattern
$$P = (({(?x, a, 1)}~{{\mathsf{OPT}}}~{(?y, a, 2)})~{{\mathsf{FILTER}}}\,\neg bound(?y))~{{\mathsf{OPT}}}~{(?y, a, 3)}. $$
The mapping μ = {?x ↦ 1, ?y ↦ 3} is a solution to P over G, but μ ⊭ ¬bound(?y).
We now present a characterisation of solutions for wwd-patterns in terms of pp-solutions that (a) takes into account that not every pp-solution can be extended to a real solution and (b) ensures correct treatment of non-well-designed top-level filters. For this we need some more notation. Given a wwd-pattern P, a node v in \(\mathcal {T}(P)\), a graph G, and a pp-solution μ to P over G, let μ|v be the projection μ|X of μ to the set X of all variables appearing in nodes u of \(\mathcal {T}(P_{\mu })\) such that u ≺ v. A mapping μ1 is subsumed by a mapping μ2 (written \(\mu _{1} \sqsubseteq \mu _{2}\)) if μ1 ∼ μ2 and dom(μ1) ⊆ dom(μ2) (this notion is from [5, 33]).
Intuitively, a pp-solution μ needs to satisfy two conditions to be a real solution to a wwd-pattern P. First, μ|v (as opposed to μ for wd-patterns) must be non-extendable to v for any child v of \(\mathcal {T}(P_{\mu })\). Indeed, if such an extension exists, then it is either possible to provide bindings for some variables that are undefined in μ, or some variables from dom(μ) can be assigned different values of higher “priority” than the corresponding values in μ. Second, every top-level filter R labelling a node s needs to be satisfied by μ|s, which is precisely the part of μ bound by the subpattern of P that is paired with R in the FILTER-pattern. The following lemma formalises this intuition.
Lemma 3
A mappingμis a solution to a wwd-pattern P over a graph G if and only if
1.
μis a pp-solution to P over G;
2.
for every child v of\(\mathcal {T}(P_{\mu })\)labelled with (B, R) there is noμ′such that\(\mu |_{v} \sqsubseteq \mu ^{\prime }\),μ′ ⊧ R,andμ′(B) ⊆ G;
3.
μ|s ⊧ Rfor every special node s in\(\mathcal {T}(P)\)labelled with R.
Proof
In this proof we write \(\mathcal {T}_{v}\) for the complete subtree of a CPT \(\mathcal {T}\) rooted at a node v (i.e., the subtree over all the descendants of v including v itself) and \(\mathcal {T}_{\prec v}\) for the subtree of \(\mathcal {T}\) consisting of all nodes u such that u ≺ v.
For the forward direction, suppose μ is a solution to P over G. Clearly, μ is a pp-solution to P over G, so it suffices to show that conditions 2 and 3 hold.
For condition 2, assume for contradiction that v is a child of \(\mathcal {T}(P_{\mu })\) labelled with (B, R) and μ′ a mapping such that \(\mu |_{v}\sqsubseteq \mu ^{\prime }\), μ′ ⊧ R, and μ′(B) ⊆ G. Moreover, without loss of generality, let dom(μ′) = dom(μ) ∪ vars(B). Let u be the parent of v in \(\mathcal {T}(P)\), and let \(\mathcal {T}\) be the largest subtree of \(\mathcal {T}(P)\) that is rooted at u and has v as the last child of u. Then
. Moreover, since u is contained in \(\mathcal {T}(P_{\mu })\), there is a mapping \(\mu _{1}\sqsubseteq \mu \) such that \(\mu _{1}\in [{\kern -2.3pt}[ \mathcal {T} ]{\kern -2.3pt}]_{G}\). Since v is not contained in \(\mathcal {T}(P_{\mu })\), we have \(\mu _{1}\sqsubseteq \mu |_{v}\) and, since \(\mathcal {T}(P_{\mu })\) is the largest r-subtree witnessing μ, μ1 is not compatible with any mapping in \([{\kern -2.3pt}[ \mathcal {T}_{v} ]{\kern -2.3pt}]_{G}\). On the other hand, μ′ satisfies the label of v, and thus, since \(\mathcal {T}_{v}\) contains no top-level filters, μ′|vars(v) can be extended to a mapping of \(\mu ^{\prime \prime }\in [{\kern -2.3pt}[ \mathcal {T}_{v} ]{\kern -2.3pt}]_{G}\). Moreover, since P is weakly well-designed, \({\mathsf {vars}(\mathcal {T}_{v})}\cap {\mathsf {dom}(\mu |_{v})}\subseteq {\mathsf {vars}(v)}\), and hence dom(μ″) ∩ dom(μ1) ⊆ dom(μ′). Thus, since μ|v is compatible with μ′, μ1 is compatible with μ″, in contradiction to the above observation that μ1 is not compatible with any mapping in \([{\kern -2.3pt}[ \mathcal {T}_{v} ]{\kern -2.3pt}]_{G}\).
For condition 3, let s be a special node in \(\mathcal {T}(P)\) labelled with R. Since μ is a solution to P, there is some μ1 ⊆ μ such that \(\mu _{1}\in [{\kern -2.3pt}[ \mathcal {T}(P)_{\prec s} ]{\kern -2.3pt}]_{G}\) and μ1 ⊧ R. Hence, it suffices to show that μ1 = μ|s. Clearly, μ1 ⊆ μ|s (as μ|s is the largest mapping compatible with μ that can occur in \([{\kern -2.3pt}[ \mathcal {T}(P)_{\prec s} ]{\kern -2.3pt}]_{G}\)), so assume for contradiction that there is a variable ?x ∈ dom(μ|s) ∖ dom(μ1). Then there is a node in \(\mathcal {T}(P_{\mu |_{s}})\cap \mathcal {T}(P)_{\prec s}\) that does not occur in \(\mathcal {T}(P_{\mu _{1}})\cap \mathcal {T}(P)_{\prec s}\). This yields a contradiction with \(\mu _{1}\in [{\kern -2.3pt}[ \mathcal {T}(P)_{\prec s} ]{\kern -2.3pt}]_{G}\) analogously to the case of condition 2.
For the backward direction, suppose that μ satisfies conditions 1–3. We show that μ ∈ [[P]]G by induction on the depth of \(\mathcal {T}(P_{\mu })\), that is, the maximal number of edges between the root and a leaf.
For the basis of the induction, let the depth of \(\mathcal {T}(P_{\mu })\) be 0, that is, the root v of \(\mathcal {T}(P)\) be the only node of \(\mathcal {T}(P_{\mu })\). We prove the claim by induction on the number n of children of v in \(\mathcal {T}(P)\). If n = 0, then P = BFILTERR for some basic pattern B and filter constraint R, and the claim follows since μ is a pp-solution to P over G. For the inductive step, suppose the claim holds for all wwd-patterns P′ and mappings μ′ satisfying 1–3 provided \(\mathcal {T}(P^{\prime }_{\mu ^{\prime }})\) has depth 0 and n − 1 children in \(\mathcal {T}(P^{\prime })\). Let P and μ be such that \(\mathcal {T}(P_{\mu })\) has depth 0 and n children in \(\mathcal {T}(P)\). Let u be the last child of (the root v of) \(\mathcal {T}(P_{\mu })\). Then μ|u is a pp-solution to \(\mathcal {T}(P)_{\prec u}\) that satisfies conditions 2 and 3 since (μ|u)|w = μ|w for every w ≺ u. Hence, by the inductive hypothesis for the pattern corresponding to \(\mathcal {T}(P)_{\prec u}\) and the mapping μ|u, we have \(\mu |_{u}\in [{\kern -2.3pt}[ \mathcal {T}(P)_{\prec u} ]{\kern -2.3pt}]_{G}\). We distinguish two cases.
Let u be a special node labelled with R. Then it suffices to show that μ|u ⊧ R, which is immediate since μ|u satisfies condition 3.
Let u be an ordinary node labelled with (B, R). We know that u is not in \(\mathcal {T}(P_{\mu })\). Since v is in \(\mathcal {T}(P_{\mu })\), by condition 2 there is no mapping μ′ such that (a) \(\mu |_{u}\sqsubseteq \mu ^{\prime }\), (b) μ′ ⊧ R, and (c) μ′(B) ⊆ G. Since R is safe, it follows that every mapping satisfying (b) and (c) is incompatible with μ|u. Consequently, every mapping in \([{\kern -2.3pt}[ \mathcal {T}(P)_{u} ]{\kern -2.3pt}]_{G}\) is incompatible with μ|u, and hence \(\mu =\mu |_{u}\in [{\kern -2.3pt}[ \mathcal {T}(P)_{\prec u} ]{\kern -2.3pt}]_{G}\setminus [{\kern -2.3pt}[ \mathcal {T}(P)_{u} ]{\kern -2.3pt}]_{G}\), as required.
For the outer inductive step, let the claim hold for all P′ and μ′ with \(\mathcal {T}(P^{\prime }_{\mu ^{\prime }})\) of depth d − 1, for some d > 0. Once again, we show the claim for P and μ with \(\mathcal {T}(P_{\mu })\) of depth d by induction on the number n of children of the root v of \(\mathcal {T}(P)\). The basis is vacuous as v cannot have 0 children while \(\mathcal {T}(P_{\mu })\) has positive depth. The inductive step is the same as for depth 0, except that we have an additional case for the last child u of the root v.
Let u be an ordinary node labelled with (B′, R′) that is contained in \(\mathcal {T}(P_{\mu })\). Then μ = μ|u ∪ μ2 where μ2 is the projection of μ to the set of variables occurring in the subtree \(\mathcal {T}\) of \(\mathcal {T}(P_{\mu })\) rooted at u (i.e., \(\mathcal {T}=\mathcal {T}(P_{\mu })_{u}\)). Since u is contained in \(\mathcal {T}(P_{\mu })\) and contains no special children, μ2 is a pp-solution to (the subpattern represented by) \(\mathcal {T}(P)_{u}\). Moreover, μ2 satisfies condition 3 with respect to \(\mathcal {T}(P)_{u}\) since \(\mathcal {T}(P)_{u}\) contains no special nodes. We next show that μ2 satisfies condition 2 with respect to \(\mathcal {T}(P)_{u}\). Let w be a child of \(\mathcal {T}\) (in \(\mathcal {T}(P)_{u}\)) labelled with (B, R), and assume for contradiction that there is some μ′ such that \(\mu _{2}|_{w}\sqsubseteq \mu ^{\prime }\), μ′ ⊧ R, and μ′(B) ⊆ G. Without loss of generality, dom(μ′) = dom(μ2) ∪ vars(B). Thus, since P is weakly well-designed, vars(B) ∩ dom(μ|u) ⊆ vars(B′) ⊆ dom(μ2). Hence, μ′ is compatible with μ|u, and \(\mu |_{w}\sqsubseteq \mu |_{u}\cup \mu ^{\prime }\). Moreover, since μ′ and μ|u ∪ μ′ coincide on vars(B) and R is safe, we have that μ|u ∪ μ′ ⊧ R and (μ|u ∪ μ′)(B) ⊆ G, contradicting the assumption for μ. Since μ2 satisfies conditions 1–3 with respect to \(\mathcal {T}(P)_{u}\), by the outer inductive hypothesis we obtain that \(\mu _{2}\in [{\kern -2.3pt}[ \mathcal {T}(P)_{u} ]{\kern -2.3pt}]_{G}\), and hence \(\mu \in [{\kern -2.3pt}[ \mathcal {T}(P)_{\prec u} ]{\kern -2.3pt}]_{G}\Join [{\kern -2.3pt}[ \mathcal {T}(P)_{u} ]{\kern -2.3pt}]_{G}\) (as \(\mu |_{u}\in [{\kern -2.3pt}[ \mathcal {T}(P)_{\prec u} ]{\kern -2.3pt}]_{G}\) holds by the inner inductive hypothesis). The claim follows.
□
Checking whether a mapping μ satisfies this characterisation is feasible in coNP, and the matching lower bound follows from the coNP-hardness of evaluation of wd-patterns [33].
The lower bound of this statement is known from [33], and the upper bound can be obtained from Lemma 3 as follows.
First we show that testing whether μ is a pp-solution takes polynomial time, same as computing the maximal witnessing tree \(\mathcal {T}(P_{\mu })\). We just proceed from the root of the tree down along the branches until we cannot find a match μ(triples(v)) in G for the basic pattern in the child v which satisfies the condition in the node, and then check that the variables in the resulting tree are exactly vars(μ). So, the crucial part is to check that \(\mathcal {T}(P_{\mu })\) is not extendable to any of its children. But there are only linearly many children, and each check can be done in coNP. Finally, the checks for top-level filters are again polynomial. □
Pérez et al. [33] extended wd-patterns to UNION by considering unions of wd-patterns, that is, patterns of the form P1UNION … UNIONPn with all \(P_{i}\in \mathcal {P}_{\text {wd}}\). We denote the resulting fragment by \(\mathcal {U}_{\text {wd}}\). This syntactic restriction on the use of UNION in \(\mathcal {U}_{\text {wd}}\) is motivated by the fact that any pattern in \(\mathcal {U}\) can be equivalently expressed as a union of UNION-free patterns [33]. We denote the fragment of all queries over patterns in \(\mathcal {U}_{\text {wd}}\) by \(\mathcal {S}_{\text {wd}}\). Similarly, we write \(\mathcal {U}_{\text {wwd}}\) for unions of wwd-patterns and \(\mathcal {S}_{\text {wwd}}\) for queries over unions of wwd-patterns.
Analogously to the well-designed case, Theorem 1 extends to fragments \(\mathcal {U}_{\text {wwd}}\) and \(\mathcal {S}_{\text {wwd}}\).
The coNP-algorithm for \(\mathcal {U}_{\text {wwd}}\) is obtained simply by applying the algorithm for \(\mathcal {P}_{\text {wwd}}\) to each pattern in the union. Hardness for \(\mathcal {S}_{\text {wwd}}\) follows from the hardness of the well-designed case [27], while for membership we just guess the values of the existential variables and then call a coNP-oracle for \(\mathcal {U}_{\text {wwd}}\) on the resulting mapping and the normalised body of the query.
Hence, the complexity of evaluation for wwd-patterns is the same as for wd-patterns. We next show that wwd-patterns are, in a certain sense, a maximal extension of wd-patterns that preserves coNP evaluation complexity (under the usual complexity-theoretic assumptions).
The definition of weakly well-designed patterns suggests two intuitive ways in which it could be relaxed. Given an occurrence i of an OPT-subpattern P1OPTP2, one could allow variables in vars(P2) ∖ vars(P1) to occur in
some subpatterns whose occurrences are not dominated by i, or
constraints of some non-top-level occurrences of FILTER-patterns.
We next show that either relaxation immediately makes the evaluation problem \({{\Pi }_2^p}\)-hard.
For the first relaxation, the arguably simplest special case would be to allow for some non-well-designed OPT-nesting of type (Opt-R). Consider the fragment \(\mathcal {P}_{\text {opt-r}}\) of patterns of the form B1OPT (B2OPTB3), where B1, B2 and B3 are basic patterns. Intuitively, \(\mathcal {P}_{\text {opt-r}}\) allows for the most simple form of non-well-designed nesting of type (Opt-R).
This theorem is a corollary of [38, Theorem 4] for their class \({\mathcal {E}}_{\leq 3}\), but without UNION. □
Now suppose we allow for some non-well-designed non-top-level filters, as suggested by the second relaxation. As we will see next, even a very restricted fragment of patterns allowing for such filters is \({{\Pi }_2^p}\)-complete. This implies that the requirement that special nodes be children of the root, while it may look somewhat ad-hoc, cannot be substantially relaxed. Consider the fragment \(\mathcal {P}_{\text {filter-2}}\) of patterns of the form
where B1, B2 and B3 are basic patterns such that vars(B3) ∩ vars(B1) ⊆ vars(B2), and R is a filter constraint. Intuitively, \(\mathcal {P}_{\text {filter-2}}\) allows for the simplest form of “second-level” filters.
This problem allows for a reduction from a restriction of Eval(\({\mathcal {P}_{\text {opt-r}}}\)). Indeed, from the proof of [38, Theorem 4] it follows that it is already \({{\Pi }_2^p}\)-hard to check whether μ ∈ [[P]]G for P of the form B1OPT (B2OPTB3) with dom(μ) = vars(B1) and vars(B2) ∖ vars(B1) ≠ ∅. Let P and μ be such a pattern and such a mapping, respectively. Consider the pattern
with \(B^{\prime }_{3}\) a basic pattern obtained from B3 by replacing all the variables ?x1, …, ?xn in (vars(B3) ∩ vars(B1)) ∖vars(B2) by their fresh copies \(?x_{1}^{\prime }, \ldots , ?x_{n}^{\prime }\) (if no such variables exist, that is, if the original pattern is well-designed, we just set R to ⊤). Clearly, \(P^{\prime }\in \mathcal {P}_{\text {filter-2}}\), so it suffices to show, for every G and μ with dom(μ) = vars(B1), that μ ∈ [[P]]G if and only if μ ∈ [[P′]]G.
For the forward direction, suppose dom(μ) = vars(B1) and μ ∈ [[P]]G. Since vars(B2) ∖vars(B1) ≠ ∅, we must have μ ∈ [[B1]]G ∖ [[B2OPTB3]]G. Thus, μ ∈ [[B1]]G and for every μ′∈ [[B2OPTB3]]G we have μ ≁ μ′. Since μ ∈ [[B1]]G, to show μ ∈ [[P′]]G it suffices to verify that μ is not compatible with any \(\mu ^{\prime }\in [{\kern -2.3pt}[ {((B_{2} \cup B_{1})~{{\mathsf {OPT}}}~B^{\prime }_{3}) ~{{\mathsf {FILTER}}}~R}]{\kern -2.3pt}]_{G}\), for which we distinguish the following two cases.
For the backward direction, suppose dom(μ) = vars(B1) and \(\mu \in [{\kern -2.3pt}[ P^{\prime } ]{\kern -2.3pt}]_{G}\). Again, since vars(B2) ∖vars(B1) ≠ ∅, we have \(\mu \in [{\kern -2.3pt}[ {B_{1}} ]{\kern -2.3pt}]_{G}\setminus [{\kern -2.3pt}[ ((B_{2} \cup B_{1})~{{\mathsf {OPT}}}~B^{\prime }_{3})\)FILTERR]]G. Thus, μ ∈ [[B1]]G and μ ≁ μ′ for every \(\mu ^{\prime }\in [{\kern -2.3pt}[ ((B_{2} \cup B_{1})~{{\mathsf {OPT}}}~B^{\prime }_{3})\)FILTERR]]G. Since μ ∈ [[B1]]G, it follows that \([{\kern -2.3pt}[ {((B_{2} \cup B_{1})~{{\mathsf {OPT}}}~B^{\prime }_{3})~{{\mathsf {FILTER}}}~R} ]{\kern -2.3pt}]_{G} =\emptyset \). To show μ ∈ [[P]]G it suffices to verify that μ is not compatible with any \(\mu ^{\prime }\in [{\kern -2.3pt}[ {B_{2}\,{{\mathsf {OPT}}}\, B_{3}} ]{\kern -2.3pt}]_{G}\). Assume for the sake of contradiction that this is not the case and there is a compatible μ′. We distinguish the following two cases.
Suppose \(\mu ^{\prime }\in [{\kern -2.3pt}[ {B_{2}} ]{\kern -2.3pt}]_{G}\Join [{\kern -2.3pt}[ {B_{3}} ]{\kern -2.3pt}]_{G}\). Then there is some μ″ such that \(\mu \cup \mu ^{\prime }\cup \mu ^{\prime \prime }\in [{\kern -2.3pt}[ {B_{1}} ]{\kern -2.3pt}]_{G}\Join [{\kern -2.3pt}[ {B_{2}} ]{\kern -2.3pt}]_{G}\Join [{\kern -2.3pt}[ {B^{\prime }_{3}} ]{\kern -2.3pt}]_{G} \) and \(\mu \cup \mu ^{\prime }\cup \mu ^{\prime \prime }\models (?x^{\prime }_{1} = {} ?x_{1}) \wedge {\dots } \wedge (?x^{\prime }_{n} = {} ?x_{n})\). Thus, \(\mu \cup \mu ^{\prime }\cup \mu ^{\prime \prime }\in [{\kern -2.3pt}[ {((B_{2} \cup B_{1})~{{\mathsf {OPT}}}~ B^{\prime }_{3})~{{\mathsf {FILTER}}}~R} ]{\kern -2.3pt}]_{G} \), which is a contradiction.
Theorems 2 and 3 suggest that \(\mathcal {P}_{\text {wwd}}\) is a maximal fragment of \(\mathcal {P}\) that does not impose structural restrictions on basic patterns or filter constraints and has a coNP evaluation algorithm (assuming \(\text {\textsc {coNP}} \neq {{\Pi }_2^p}\)). Hence, going beyond wwd-patterns while preserving good computational properties requires more refined restrictions, possibly in the spirit of [27, Section 4].
6 Expressivity of wwd-Patterns and Their Extensions
In this section, we analyse the expressive power of our fragments.
Definition 9
A language \(\mathcal {L}_{1}\) is strictly less expressive than a language \(\mathcal {L}_{2}\) (written \(\mathcal {L}_{1}<\mathcal {L}_{2}\)) if for every query Q1 in \(\mathcal {L}_{1}\) there is a query Q2 in \(\mathcal {L}_{2}\) such that Q1 ≡ Q2, and there is a query Q2 in \(\mathcal {L}_{2}\) such that Q1 ≢ Q2 for every query Q1 in \(\mathcal {L}_{1}\).
We begin with UNION-free patterns, establishing that \(\mathcal {P}_{\text {wd}}<\mathcal {P}_{\text {wwd}}<\mathcal {P}\), and then proceed to unions, showing that \(\mathcal {U}_{\text {wd}}<\mathcal {U}_{\text {wwd}}<\mathcal {U}\), and queries, showing that \(\mathcal {S}_{\text {wd}}<\mathcal {S}_{\text {wwd}}<\mathcal {S}\).
Following [5, 33], a set of mappings Ω1 is subsumed by a set of mappings Ω2 (written \({\Omega }_{1} \sqsubseteq {\Omega }_{2}\)) if for every μ1 ∈ Ω1 there exists a mapping μ2 ∈ Ω2 such that \(\mu _{1} \sqsubseteq \mu _{2}\). A query Q is weakly monotone if \([{\kern -2.3pt}[ Q ]{\kern -2.3pt}]_{G_{1}} \sqsubseteq [{\kern -2.3pt}[ Q ]{\kern -2.3pt}]_{G_{2}}\) for any two graphs G1 and G2 with G1 ⊆ G2, and a fragment \(\mathcal {L}\) is weakly monotone if it contains only weakly monotone queries. Arenas and Pérez [5] showed that, unlike \(\mathcal {P}\), the fragment \(\mathcal {P}_{\text {wd}}\) is weakly monotone, and hence \(\mathcal {P}_{\text {wd}}<\mathcal {P}\).
Example 4
(Pérez et al. [33]) Consider the non-well-designed pattern
$$\begin{array}{*{20}l} P = {(?x, a, 1)}~{{\mathsf{OPT}}}~({(?y, a, 2)}~{{\mathsf{OPT}}}~{(?x, a, 3)}) \end{array} $$
as well as graphs G1 = {(1, a, 1), (2, a, 2)} and G2 = G1 ∪ {(3, a, 3)}. Then μ1 = {?x ↦ 1, ?y ↦ 2} is the only mapping in \([{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{G_{1}}\) while μ2 = {?x↦1} is the only mapping in \([{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{G_{2}}\). Hence \([{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{G_{1}}\not \sqsubseteq [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{G_{2}}\), meaning that P is not weakly monotone.
Analogously, we show that \(\mathcal {P}_{\text {wd}}<\mathcal {P}_{\text {wwd}}\) by observing that \(\mathcal {P}_{\text {wwd}}\) is not weakly monotone.
Proposition 7
Fragment\(\mathcal {P}_{\text {wwd}}\)is not weakly monotone.
Proof
Consider a wwd-pattern
$$P=({(?x, a, 1)}~{{\mathsf{OPT}}}~{(?y, a, 2)})~{{\mathsf{OPT}}}~{(?y, a, 3)}, $$
as well as graphs G1 = {(1, a,1),(3, a,3)} and G2 = G1 ∪{(2, a,2)}. Then \([{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{1}}}=\{\{?x\mapsto 1,?y\mapsto 3\}\}\not \sqsubseteq \{\{?x\mapsto 1,?y\mapsto 2\}\}=[{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{2}}}\). □
An alternative proof of \(\mathcal {P}_{\text {wd}}<\mathcal {P}_{\text {wwd}}\) can be obtained by adapting Theorem 3.5 in [6], which exhibits a weakly well-designed, weakly monotone pattern that is not equivalent to any well-designed pattern.
To distinguish \(\mathcal {P}_{\text {wwd}}\) from \(\mathcal {P}\) we need a different property.
Definition 10
A query Q is non-reducing if for any two graphs G1, G2 such that G1 ⊆ G2 and any mapping \(\mu _{1} \in [{\kern -2.3pt}[ Q ]{\kern -2.3pt}]_{G_{1}}\) there is no \(\mu _{2} \in [{\kern -2.3pt}[ Q ]{\kern -2.3pt}]_{G_{2}}\) such that \(\mu _{2} \sqsubset \mu _{1}\) (i.e., \(\mu _{2} \sqsubseteq \mu _{1}\) and μ2 ≠ μ1). A fragment is non-reducing if it contains only non-reducing queries.
Intuitively, for a non-reducing query extending a graph cannot result in a previously bound answer variable becoming unbound. All weakly monotone queries are non-reducing but not vice versa. Moreover, all wwd-patterns are non-reducing.
Let \(P\in \mathcal {P}_{\text {wwd}}\) and let G1, G2 be two graphs such that G1 ⊆ G2. We show that \(\mu _{2}\not \sqsubset \mu _{1}\) for any \(\mu _{1}\in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{1}}}\) and \(\mu _{2}\in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{2}}}\) by induction on the structure of P, proving, in parallel, that if all filters in P are over basic patterns, then for every mapping \(\mu _{1}\in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{1}}}\) there is a mapping \(\mu _{2}\in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{2}}}\) such that μ1|vars(v) = μ2|vars(v) for v the root of \(\mathcal {T}(P)\).
For the base case, suppose P = BFILTERR for some basic pattern B and filter constraint R. Then, P is monotone in the sense of [5], that is, satisfies \([{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{1}}}\subseteq [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{2}}}\). Moreover, P contains no OPT, and hence every two distinct mappings in \([{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{G_{2}}\) have the same domain and are thus incompatible. These facts imply both claims.
For the inductive step, suppose first that P = P1OPTP2 and both claims hold for P1 and P2. Let \(\mu _{1}\in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{G_{1}}\). We first prove that \(\mu _{2}\not \sqsubset \mu _{1}\) for any \(\mu _{2}\in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{2}}}\). We distinguish two cases.
Let \(\mu _{1}={\mu _{1}^{1}}{\cup \mu _{1}^{2}}\) where \({\mu _{1}^{i}}\in [{\kern -2.3pt}[ P_{i} ]{\kern -2.3pt}]_{G_{1}}\). Assume for contradiction that \(\mu _{2}\sqsubset \mu _{1}\) for some \(\mu _{2}\in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{2}}}\). We begin by showing that μ2 must be of the form \({\mu _{2}^{1}}{\cup \mu _{2}^{2}}\) where \({\mu _{2}^{i}}\in [{\kern -2.3pt}[ P_{i} ]{\kern -2.3pt}]_{G_{2}}\), for which it suffices to show that \(\mu _{2}|_{{\mathsf {vars}(P_{1})}}\) is compatible with some mapping in \([{\kern -2.3pt}[ P_{2} ]{\kern -2.3pt}]_{G_{2}}\). On the one hand, since \(\mu _{2}\sqsubset \mu _{1}\), \({\mu _{1}^{2}}\) is compatible with \(\mu _{2}|_{{\mathsf {vars}(P_{1})}}\). On the other hand, since all filters in P2 are over basic patterns, the inductive hypothesis tells us that \([{\kern -2.3pt}[ P_{2} ]{\kern -2.3pt}]_{G_{2}}\) contains a mapping μ′ that coincides with \({\mu _{1}^{2}}\) on the set of variables X in the root of \(\mathcal {T}(P_{2})\); moreover, since P is weakly well-designed, dom(μ′) ∩ vars(P1) ⊆ X, and hence μ′ is compatible with \(\mu _{2}|_{{\mathsf {vars}(P_{1})}}\). Thus, \(\mu _{2}={\mu _{2}^{1}}{\cup \mu _{2}^{2}}\) where \({\mu _{2}^{i}}\in [{\kern -2.3pt}[ P_{i} ]{\kern -2.3pt}]_{G_{2}}\). Then, however, we must have that \({\mu _{2}^{1}}{\sqsubset \mu _{1}^{1}}\) or \({\mu _{2}^{2}}{\sqsubset \mu _{1}^{2}}\), contradicting the inductive hypothesis for P1 or P2, respectively.
Let \(\mu _{1}\in [{\kern -2.3pt}[ P_{1} ]{\kern -2.3pt}]_{G_{1}}\setminus [{\kern -2.3pt}[ P_{2} ]{\kern -2.3pt}]_{G_{1}}\), and let μ2 be an arbitrary mapping in \([{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{2}}}\). Then μ2 extends some \(\mu ^{\prime }\in [{\kern -2.3pt}[ P_{1} ]{\kern -2.3pt}]_{G_{2}}\). By the inductive hypothesis for claim 2, we have that \(\mu ^{\prime }\not \sqsubset \mu _{1}\), and hence \(\mu _{2}\not \sqsubset \mu _{1}\).
Suppose now that all filters in P are over basic patterns. We need to prove that there is \(\mu _{2}\in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{2}}}\) such that μ1|vars(v) = μ2|vars(v). We know that μ1 extends some \(\mu ^{\prime }_{1}\in [{\kern -2.3pt}[ P_{1} ]{\kern -2.3pt}]_{G_{1}}\). Thus, by the inductive hypothesis, there is some \(\mu ^{\prime }_{2}\in [{\kern -2.3pt}[ P_{1} ]{\kern -2.3pt}]_{G_{2}}\) that coincides with \(\mu ^{\prime }_{1}\) on the variables in the root of \(\mathcal {T}(P_{1})\). The claim follows since \(\mu ^{\prime }_{2}\) can be extended to a mapping μ2 for P that coincides with \(\mu ^{\prime }_{2}\) on the variables in the root of \(\mathcal {T}(P_{1})\), and, by construction, the root of \(\mathcal {T}(P_{1})\) and the root of \(\mathcal {T}(P)\) have the same label.
Consider now the inductive step for the case when P = P1FILTERR. Since P1 is not a basic pattern, we only need to show that \(\mu _{2}\not \sqsubset \mu _{1}\) for any \(\mu _{1}\in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{1}}}\) and \(\mu _{2}\in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{2}}}\). This holds by the inductive hypothesis, because \(\mu _{1}\in [{\kern -2.3pt}[ P_{1} ]{\kern -2.3pt}]_{G_{1}}\) and \(\mu _{2}\in [{\kern -2.3pt}[ P_{1} ]{\kern -2.3pt}]_{G_{2}}\) for any such μ1 and μ2. □
In contrast to Proposition 8, patterns in \(\mathcal {P}\) do not generally satisfy non-reducibility. For instance, consider again pattern P, graphs G1, G2, and mappings μ1, μ2 from Example 4. Pattern P is not non-reducing since \(\mu _{1}\in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{G_{1}}\) and \(\mu _{2}\in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{G_{2}}\) but \(\mu _{2}\sqsubset \mu _{1}\). Therefore, we have the following theorem.
Theorem 4
It holds that\(\mathcal {P}_{\text {wd}}<\mathcal {P}_{\text {wwd}}<\mathcal {P}\).
We next compare \(\mathcal {U}_{\text {wwd}}\) to \(\mathcal {U}_{\text {wd}}\) and \(\mathcal {U}\), as well as \(\mathcal {S}_{\text {wwd}}\) to \(\mathcal {S}_{\text {wd}}\) and \(\mathcal {S}\) (note that neither UNION nor projection via SELECTcan be expressed by means of the other operators [40], so adding either construct makes each fragment strictly more expressive). It is easily seen that \(\mathcal {U}_{\text {wd}}\) and \(\mathcal {S}_{\text {wd}}\) inherit weak monotonicity from \(\mathcal {P}_{\text {wd}}\) [27, 33], and hence \(\mathcal {U}_{\text {wd}}<\mathcal {U}_{\text {wwd}}\) and \(\mathcal {S}_{\text {wd}}<\mathcal {S}_{\text {wwd}}\). Non-reducibility, however, does not propagate to unions.
Example 5
Consider the following \(\mathcal {U}_{\text {wd}}\)-pattern with G1, G2 and μ1, μ2 from Example 4:
$$\begin{array}{*{20}l} P = ({(?x, a, 1)}~{{\mathsf{OPT}}}~{(?y, a, 2)})~{{\mathsf{UNION}}}~{(?x, a, 1)}. \end{array} $$
We have \(\mu _{1}\in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{G_{1}}\) and \(\mu _{2}\in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{G_{2}}\) but \(\mu _{2}\sqsubset \mu _{1}\), which is due to the fact that μ2 is already contained in \([{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{G_{1}}\) along with μ1. This is only possible in the presence of UNION since all mappings in the evaluation of a UNION-free pattern are mutually non-subsuming (see Lemma 2).
Thus, to account for UNION, we introduce the following, more delicate property.
Definition 11
A query Q is extension-witnessing (e-witnessing) if for any two graphs G1 ⊆ G2 and mapping \(\mu \!\in \![{\kern -2.3pt}[ Q ]{\kern -2.3pt}]_{G_{2}}\) such that \(\mu \!\notin \![{\kern -2.3pt}[ Q ]{\kern -2.3pt}]_{G_{1}}\!\) there is a triple t in Q such that vars(t) ⊆ dom(μ) and μ(t) ∈ G2 ∖ G1. A fragment is e-witnessing if so are all of its queries.
Informally, a query Q is e-witnessing if whenever an extension of a graph leads to a new answer, this answer is justified by a triple pattern in Q which maps to the extension. Unions of wwd-patterns can be shown e-witnessing.
Let \(P\in \mathcal {U}_{\text {wwd}}\) and let G1, G2 be graphs such that G1 ⊆ G2. Let μ be a mapping in \([{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{G_{2}}\) but not in \([{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{1}}}\). We show that there is some t ∈ triples(P) such that μ(t) ∈ G2 ∖ G1.
Since P is a union of wwd-patterns, there is some wwd-pattern P′ in the union such that \(\mu \in [{\kern -2.3pt}[ P^{\prime } ]{\kern -2.3pt}]_{G_{2}}\). It suffices to show \(\mu (\mathsf {triples}(P^{\prime }_{\mu }))\cap (G_{2}\setminus G_{1})\ne \emptyset \), where \(P^{\prime }_{\mu }\) is the pattern corresponding to the maximal r-subtree of P witnessing μ in G2 (i.e., the part of P in the image of μ, see Definition 8). We know that \(\mu (\mathsf {triples}(P^{\prime }_{\mu })) \subseteq G_{2}\). Assume, for contradiction, that \(\mu (\mathsf {triples}(P^{\prime }_{\mu }))\subseteq G_{1}\). Then μ is a pp-solution to P′ over G1. We next show that μ is a real solution to P′ over G1. By Lemma 3, it suffices to show that (a) for any child u of \(\mathcal {T}(P^{\prime }_{\mu })\) labelled with (B, R), there is no mapping μ′ such that \(\mu |_{u}\sqsubseteq \mu ^{\prime }\), μ′ ⊧ R, and μ′(B) ⊆ G1, and (b) μ|s ⊧ R for any special node s in \(\mathcal {T}(P^{\prime })\) labelled with R. Claim (a) holds since \(\mu \in [{\kern -2.3pt}[ P^{\prime } ]{\kern -2.3pt}]_{G_{2}}\) and G1 ⊆ G2 while (b) holds since \(\mu \in [{\kern -2.3pt}[ P^{\prime } ]{\kern -2.3pt}]_{G_{2}}\) and the claim does not depend on the graph over which the evaluation is computed. Consequently, \(\mu \in [{\kern -2.3pt}[ P^{\prime } ]{\kern -2.3pt}]_{G_{1}}\), and hence \(\mu \in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{{G_{1}}}\), in contradiction to the assumption. □
On the other hand, \(\mathcal {U}\) is not e-witnessing, as can be seen on the pattern and graphs in Example 4. Hence, we obtain the following theorem.
Theorem 5
It holds that\(\mathcal {U}_{\text {wd}}<\mathcal {U}_{\text {wwd}}<\mathcal {U}\).
Next we move to the fragments that allow for projection. As already mentioned, we have \(\mathcal {S}_{\text {wd}}<\mathcal {S}_{\text {wwd}}\) since \(\mathcal {S}_{\text {wd}}\) is weakly monotone while \(\mathcal {S}_{\text {wwd}}\) is not. However, \(\mathcal {S}_{\text {wwd}}\) is not e-witnessing, so we cannot apply the technique of Theorem 5 to establish \(\mathcal {S}_{\text {wwd}}<\mathcal {S}\); instead, we make use of the following lemma.
Lemma 4
Let Q be a query in\(\mathcal {S}_{\text {wwd}}\)and G be a graph. For every graphG1withG ⊆ G1and every\(\mu \in [{\kern -2.3pt}[ Q ]{\kern -2.3pt}]_{{G_{1}}}\),there is a graphG2withG ⊆ G2such that\(\mu \in [{\kern -2.3pt}[ Q ]{\kern -2.3pt}]_{{G_{2}}}\)and |G2| ≤ |G| + |triples(Q)|.
Proof
Let Q = SELECTXWHEREP, for P a union of wwd-patterns, and let G, G1 and μ be as required. Then there is a wwd-pattern P′ in the union P such that \(\mu ^{\prime }\in [{\kern -2.3pt}[ P^{\prime } ]{\kern -2.3pt}]_{G_{1}}\) for some μ′ with μ′|X = μ. Let \(G_{2}=G\cup \mu ^{\prime }(\mathsf {triples}(P^{\prime }_{\mu ^{\prime }}))\). Clearly, |G2| ≤ |G| + |triples(Q)|, so it suffices to show that \(\mu ^{\prime }\in [{\kern -2.3pt}[ P^{\prime } ]{\kern -2.3pt}]_{G_{2}}\).
By construction, μ′ is a pp-solution to P′ over G2. Moreover, since μ′ is a solution to P′ over G1, we have that μ|s ⊧ R for every special node s in \(\mathcal {T}(P^{\prime })\) labelled with R. Finally, suppose for contradiction that there is a child v of \(\mathcal {T}(P^{\prime }_{\mu ^{\prime }})\) labelled with (B, R) and a mapping μ″ such that \(\mu ^{\prime }|_{v}\sqsubseteq \mu ^{\prime \prime }\), μ″ ⊧ R, and μ″(B) ⊆ G2. However, since G2 ⊆ G1, we then have μ″(B) ⊆ G1, which contradicts the fact that \(\mu ^{\prime }\in [{\kern -2.3pt}[ P^{\prime } ]{\kern -2.3pt}]_{G_{1}}\). □
This lemma is the base of the last result of the section.
Theorem 6
It holds that\(\mathcal {S}_{\text {wd}}<\mathcal {S}_{\text {wwd}}<\mathcal {S}\).
Proof
As observed before, the inclusion \(\mathcal {S}_{\text {wd}}<\mathcal {S}_{\text {wwd}}\) holds since \(\mathcal {S}_{\text {wd}}\) is weakly monotone [27, 33] and \(\mathcal {S}_{\text {wwd}}\) is not.
As for the second inclusion, consider the family of graphs
By equivalence (5) in Section 3, the operator DIFF can be expressed via OPT, AND and FILTER, so we can assume that \(Q \in \mathcal {S}\). On the other hand, it is easily seen that \(Q\notin \mathcal {S}_{\text {wwd}}\). The mapping μ = {?x ↦ a} is an answer to Q over G1 but not an answer over any Gn with n ≥ 2. Moreover, it is easily seen that any extension G of Gn such that μ ∈ [[Q]]G requires the addition of at least n − 1 triples, namely {(d2, c, c), …, (dn, c, c)}. Consequently, μ ∈ [[Q]]G implies |G| ≥ |G1| + n − 1.
Suppose for contradiction there is a query Q′ in \(\mathcal {S}_{\text {wwd}}\) such that Q′ ≡ Q. Let n = |triples(Q′)| + 2. Then, by Lemma 4, μ ∈ [[Q′]]G for some G with |G| ≤ |Gn| + |triples(Q′)| = |Gn| + n − 2, which contradicts the above observation for Q. □
7 Static Analysis of wwd-Patterns
In this section, we look at the general static analysis problems of query equivalence and containment. Formally, equivalence for a language \(\mathcal {L}\) is defined as follows.
×
This problem is commonly generalised to \(\textsc {Containment}({\mathcal {L}})\), in which one checks whether Q is contained in Q′, written Q ⊆ Q′, that is, whether [[Q]]G ⊆ [[Q′]]G holds for every graph G. We have Q ≡ Q′ if and only if Q and Q′ contain each other.
These problems have been studied for FILTER-free wd-patterns in [27, 35], establishing NP-completeness of equivalence and containment. Moreover, both problems are \({{\Pi }_{2}^{p}}\)-complete for unions of FILTER-free wd-patterns, and undecidable for fragments with projection. Finally, from the results in [41] it follows that containment is undecidable for \(\mathcal {U}\). On the other hand, nothing seems to be known so far for well-designed patterns with FILTER.
We next show that equivalence and containment are both \({{\Pi }_2^p}\)-complete for \(\mathcal {P}_{\text {wwd}}\) and \(\mathcal {U}_{\text {wwd}}\) (whereas they are undecidable for \(\mathcal {S}_{\text {wwd}}\) by the results in [35]). As the following lemma shows, the upper bound for containment follows from a small counterexample property: if P ⊈ P′ for some P and P′ from \(\mathcal {U}_{\text {wwd}}\), then there is a witnessing mapping and graph of size O(|P| + |P′|). Given this property, a \({{\Pi }_2^p}\) algorithm for containment is straightforward—we guess a mapping μ and a graph G of linear size, check that μ ∉ [[P′]]G, and then call a coNP oracle for checking μ ∈ [[P]]G. As a corollary, Equivalence(\({\mathcal {U}_{\text {wwd}}}\)) is also in \({{\Pi }_{2}^{p}}\).
Lemma 5
Let P andP′be two patterns from\(\mathcal {U}_{\text {wwd}}\).IfP ⊈ P′then there exists a mappingμand a graph G of sizeO(|triples(P)| + |triples(P′)|) such thatμ ∈ [[P]]Gbutμ ∉ [[P′]]G.
where all Pi and \(P^{\prime }_{j}\) are wwd-patterns in OF-normal form.
Since P ⊈ P′, there exists a graph G′, a mapping μ, and a pattern Pi, 1 ≤ i ≤ n, such that \(\mu \in [{\kern -2.3pt}[ P_{i} ]{\kern -2.3pt}]_{G^{\prime }}\), but \(\mu \notin [{\kern -2.3pt}[ P^{\prime }_{j} ]{\kern -2.3pt}]_{G^{\prime }}\) for every \(P^{\prime }_{j}\). Hence, μ is a pp-solution to Pi over G′ with corresponding r-subtree \(\mathcal {T}((P_{i})_{\mu })\) of the CPT \(\mathcal {T}(P_{i})\). Let G0 = μ(triples((Pi)μ)). By construction, we have that G0 ⊆ G′ and |G0| ≤ |triples((Pi)μ)| ≤ |triples(Pi)| ≤ |triples(P)|. Moreover, \(\mu \in [{\kern -2.3pt}[ P_{i} ]{\kern -2.3pt}]_{G_{0}}\), because all the matches and constraints, including the ones on the top-level, stay unchanged. In fact, \(\mu \in [{\kern -2.3pt}[ P_{i} ]{\kern -2.3pt}]_{G^{\prime \prime }}\) for any G″ such that G0 ⊆ G″⊆ G′.
If \(\mu \notin [{\kern -2.3pt}[ P^{\prime }_{j} ]{\kern -2.3pt}]_{G_{0}}\) for every j, then G0 satisfies all the properties required from G. Otherwise, there exists \(P^{\prime }_{j}\) among \(P^{\prime }_{1}, \ldots , P^{\prime }_{m}\) such that \(\mu \in [{\kern -2.3pt}[ P^{\prime }_{j} ]{\kern -2.3pt}]_{G_{0}}\). Since G0 ⊆ G′, μ is a pp-solution to \(P^{\prime }_{j}\) over G′. Consider the corresponding pattern \((P^{\prime }_{j})_{\mu }\) (i.e., the maximal pattern witnessing μ in G′ obtained from \(P^{\prime }_{j}\) by dropping the right arguments of some OPT operators), the r-subtree \(\mathcal {T}((P^{\prime }_{j})_{\mu })\) of the CPT \(\mathcal {T}(P^{\prime }_{j})\), and the “image” \(\mu (\mathsf {triples}((P^{\prime }_{j})_{\mu }))\). Note that we may have \(\mu (\mathsf {triples}((P^{\prime }_{j})_{\mu })) \subseteq G_{0}\) or not: the latter is possible because the maximal r-subtree of \(\mathcal {T}(P^{\prime }_{j})\) witnessing μ in G0 may be different from \(\mathcal {T}((P^{\prime }_{j})_{\mu })\), which is maximal in G′. Let \(G^{\prime }_{1} = G_{0} \cup \mu (\mathsf {triples}((P^{\prime }_{j})_{\mu }))\). We define G1 depending on whether \(\mu \in [{\kern -2.3pt}[ P^{\prime }_{j} ]{\kern -2.3pt}]_{G^{\prime }_{1}}\) or not. If \(\mu \notin [{\kern -2.3pt}[ P^{\prime }_{j} ]{\kern -2.3pt}]_{G^{\prime }_{1}}\), then let \(G_{1} = G^{\prime }_{1}\). Otherwise, since \(\mu \notin [{\kern -2.3pt}[ P^{\prime }_{j} ]{\kern -2.3pt}]_{G^{\prime }}\) by assumption, there exists a child v of \(\mathcal {T}((P^{\prime }_{j})_{\mu })\) and a mapping μ0 such that \(\mu |_{v}\sqsubset \mu _{0}\) and μ0(triples(v)) ⊆ G′. Then the graph \(G_{1} = G^{\prime }_{1} \cup \mu _{0}(\mathsf {triples}(v))\) is such that \(\mu \notin [{\kern -2.3pt}[ P^{\prime }_{j} ]{\kern -2.3pt}]_{G_{1}}\). In either case, \(\mu \in [{\kern -2.3pt}[ P_{i} ]{\kern -2.3pt}]_{G_{1}}\) because G0 ⊆ G1 ⊆ G′. Moreover, we have \(\mu \notin [{\kern -2.3pt}[ P^{\prime }_{j} ]{\kern -2.3pt}]_{G^{\prime \prime }}\) for every G″ such that G1 ⊆ G″ ⊆ G′. To see this, suppose for contradiction that \(\mu \in [{\kern -2.3pt}[ P^{\prime }_{j} ]{\kern -2.3pt}]_{G^{\prime \prime }}\) for a graph G″ as above. Then there must be a child v′ of \(\mathcal {T}((P^{\prime }_{j})_{\mu _{0}})\) such that v′≺ v, \(\mu _{o}|_{v^{\prime }}\sqsubset \mu \) and μ(triples(v′)) ⊆ G″. Since \(\mathcal {T}((P^{\prime }_{j})_{\mu _{0}})\) and \(\mathcal {T}((P^{\prime }_{j})_{\mu })\) are identical restricted to nodes preceding v with respect to ≺, v′ is a child of \(\mathcal {T}((P^{\prime }_{j})_{\mu })\). Thus, v′ is not contained in \(\mathcal {T}((P^{\prime }_{j})_{\mu })\), which contradicts maximality of \(\mathcal {T}((P^{\prime }_{j})_{\mu })\) since μ(triples(v′)) ⊆ G″⊆ G′.
If \(\mu \notin [{\kern -2.3pt}[ P^{\prime }_{j} ]{\kern -2.3pt}]_{G_{1}}\) for all other j as well, then G1 satisfies all the properties required from G. Otherwise we can extend G1 to a graph G2 on the base of some other \(P^{\prime }_{j}\) with \(\mu \in [{\kern -2.3pt}[ P^{\prime }_{j} ]{\kern -2.3pt}]_{G_{1}}\) in the same way as G1 extends G0. We then have G2 ⊆ G′, \(\mu \in [{\kern -2.3pt}[ P ]{\kern -2.3pt}]_{G_{2}}\), and \(\mu \notin [{\kern -2.3pt}[ P^{\prime }_{j} ]{\kern -2.3pt}]_{G_{2}}\) for j from both steps. Repeating the extension step until there are no \(P^{\prime }_{j}\) having μ as a solution on the resulting graph, we obtain a graph that satisfies all the properties required from G; in particular, for each j the number of added triples to the graph is bounded by \(|\mathsf {triples}(P^{\prime }_{j})|\). □
Hardness of equivalence is established in the following lemma by a reduction of ∀∃3SAT, while containment is \({{\Pi }_2^p}\)-hard by the results in [35]. Note that both results hold even for fragments without FILTER.
Lemma 6
ProblemEquivalence(\({\mathcal {O}_{\text {wwd}}}\))is\({{\Pi }_{2}^{p}}\)-hardfor the fragment\(\mathcal {O}_{\text {wwd}}\)ofFILTER-freewwd-patterns.
Proof
We proceed by reduction of the ∀∃3SAT problem, that is, the problem of checking whether a formula of the form
$$ \forall \bar{x} \, \exists \bar y \, \psi, $$
(14)
holds for a conjunction ψ of clauses t1 ∨ t2 ∨ t3 with ti propositional literals, that is, propositional variables from \(\bar {x} \cup \bar y\) or their negations. Without loss of generality, we assume that ψ contains no tautologous clauses and no clauses with duplicate literals. Let ϕ be a formula of the form (14). Starting from ϕ, we construct FILTER-free wwd-patterns P and P′ in OF-normal form, and then show that ϕ is true if and only if P ≡ P′. Let \(\bar {x} = x_{1}, \ldots , x_{n}\) and \(\bar y = y_{1}, \ldots , y_{m}\).
For each clause γ = t1 ∨ t2 ∨ t3, there are exactly 7 assignments to the variables in t1, t2, t3 making γ true, and exactly one assignment making γ false (since γ is assumed to be non-tautologous and contain no duplicate literals). Let, for each such γ in ψ, each ℓ, 1 ≤ ℓ ≤ 7, and each j, 1 ≤ j ≤ 3, val(γ, j, ℓ) = ⊤ if the variable of literal tj evaluates to true in the ℓ’th assignment making γ true, and val(γ, j, ℓ) = ⊥, otherwise; here ⊤ and ⊥ are fresh IRIs. Let also, for every clause γ in ψ, \(\mathit {cl}_{\gamma }^{1} ,\dots ,\mathit {cl}_{\gamma }^{7} \) and \(\mathit {lit}_{\gamma }^{1} \), \(\mathit {lit}_{\gamma }^{2} \), \(\mathit {lit}_{\gamma }^{3} \) be fresh IRIs. We define, for each γ and 1 ≤ ℓ ≤ 7, a basic pattern
(note that these patterns do not have any variables).
Let, for each propositional variable \(z \in \bar {x} \cup \bar y\), iriz be a fresh IRI and ?z be a fresh SPARQL variable. For each γ, let also ?cγ, \(?v_{\gamma }^{1} \), \(?v_{\gamma }^{2} \), \(?v_{\gamma }^{3} \) be fresh variables. Let
where each \(\mathit {var}_{\gamma }^{j} \) and \(\mathit {iri}_{\gamma }^{j} \), 1 ≤ j ≤ 3, are the variable and the IRI corresponding to the variable of literal tj in γ; that is, \(\mathit {var}_{\gamma }^{j} ={?z}\) and \(\mathit {iri}_{\gamma }^{j} = iri_{z}\) if tj = z or tj = ¬z.
Let ?u and ?s be fresh variables and r, cy⊥, cy⊤ fresh IRIs. We define
be two FILTER-free wwd-patterns in OF-normal form.
We next show that ϕ is true if and only if P is equivalent to P′, starting with the forward direction.
Let ϕ be true, yet, for the sake of contradiction, P is not equivalent to P′. Then there is a graph G and mapping μ such that μ ∈ [[P]]G, but μ ∉ [[P′]]G. Since patterns P and P′ have the same root Bbase, which contains ?u as the only variable, we conclude that ?u ∈ dom(μ). Each ?xi is also in dom(μ) by the construction of P, since there is a homomorphism from the corresponding leaf \(B_{i}^{\top }\) to the root Bbase. However, it is not necessary that \({(\mu (?x_{i}), iri_{x_{i}}, \top )}\) is in G because if G contains a triple of the form \({(c, iri_{x_{i}}, \bot )}\) for some IRI c, we will have \({(\mu (?x_{i}), iri_{x_{i}}, \bot )}\in G\). Note also that nothing prevents G from containing both a triple \({(c, iri_{x_{i}}, \bot )}\) and a triple \({(c, iri_{x_{i}}, \top )}\) for some i. Depending on whether ?s ∈ dom(μ) or not, we have two cases.
Case 1
Let ?s ∈ dom(μ), that is, there is a homomorphism from Bψ to G that aligns with the previous assignment of all ?xi. In particular, this means that dom(μ) = vars(P) = vars(P′). If there is no homomorphism from Bvalid to G, then μ ∈ [[P′]]G, because Bψ is the last leaf of P′ as well, and nothing prevents it from matching. But this contradicts the assumption. However, even if there is a homomorphism h from Bvalid to G, we still have a contradiction because \(\mu \in [{\kern -2.3pt}[ P^{\prime } ]{\kern -2.3pt}]_{G}\) still holds. Indeed, ?s is the only variable in Bvalid and is essentially isolated in Bvalid, so if h is a homomorphism from Bvalid to G, then h′, which maps ?s to μ(?s), is also such a homomorphism (in other words, since by the assumptions of this case, we know that (?s, r, ?s) has a match in G, the existence of h just means that all the ground triples of Bvalid are in G). This means, however, that nothing prevents Bψ from matching in P′, implying μ ∈ [[P′]]G.
Case 2
Let ?s ∉ dom(μ). Since μ ∉ [[P′]]G, there is no homomorphism from Bψ to G but there is one from Bvalid to G, that is, all ground triples of Bvalid are in G (the non-existence of a homomorphism from Bψ to G is immediate since ?s ∉ dom(μ) and μ ∈ [[P]]G; the existence of a homomorphism from Bvalid to G then follows since otherwise we would have \(\mu \in [{\kern -2.3pt}[ P^{\prime } ]{\kern -2.3pt}]_{G}\)). Consider now a truth assignment α of variables \(\bar {x}\) such that if α(xi) is true then \({(\mu (?x_{i}), iri_{x_{i}}, \top )} \in G\) and if α(xi) is false then \({(\mu (?x_{i}), iri_{x_{i}}, \bot )} \in G\) (as we mentioned earlier, α may be not unique, but the argument does not depend on its uniqueness). Since ϕ is true, we know that α can be extended to the variables \(\bar y\) in such a way that each clause in ψ holds. Let α′ be such an extension, and let μ′ be an extension of μ to the variables in Bψ such that, for all j, μ′(?yj) = cy⊤ if α′(yj) is true and μ′(?yj) = cy⊥ otherwise. Then, for every clause γ in ψ, the IRIs \(\mu ^{\prime }(?v_{\gamma }^{1} )\), \(\mu ^{\prime }(?v_{\gamma }^{2})\), \(\mu ^{\prime }(?v_{\gamma }^{3} )\) correspond to the values α′(z1), α′(z2), α′(z3), respectively, where z1, z2, z3 are the variables in the literals t1, t2, t3 of γ. Moreover, \(\mu ^{\prime }(?c_{\gamma }) = \mathit {cl}_{\gamma }^{\ell } \) for ℓ the number of the assignment α′(z1), α′(z2), α′(z3); this assignment makes γ true by the choice of α′ (in other words, for every γ there is some ℓ such that \({(\mu ^{\prime }(?c_{\gamma }), \mathit {lit}_{\gamma }^{i}, \mu ^{\prime }(?v_{\gamma }^{i} ))}\in B_{\gamma }^{\ell }\) for all 1 ≤ i ≤ 3). Hence, the extension μ′ is contained in [[P]]G, and hence μ ∉ [[P]]G, which, however, contradicts the original assumption.
Since both cases yield a contradiction, we conclude that P is equivalent to P′.
We continue with the backward direction of the equivalence. Suppose that P ≡ P′, yet, for the sake of contradiction, ϕ is false. Then, there is a truth assignment α of the variables \(\bar {x}\) such that for each extension α′ of α to the variables \(\bar y\) there is a clause γ in ψ that evaluates to false under α′. Fix such an α and consider the graph G consisting of
the triple \({(u, iri_{x_{i}}, \top )}\), for each xi in \(\bar {x}\) and a fresh IRI u;
the triple \({(c_{x_{i}}, iri_{x_{i}}, \top )}\), for each xi in \(\bar {x}\) with α(xi) true, and the triple \({({c_{x_{i}}}, {iri_{x_{i}}}, {\bot })}\), for each xi in \(\bar {x}\) with α(xi) false, where \(c_{x_{1}},\dots ,c_{x_{n}}\) are fresh IRIs;
all ground triples from Bvalid, that is, all of its triples except (?s, r, ?s);
the triple (s, r, s) for a fresh IRI s.
Consider also the mapping μ such that
μ(?u) = u;
\(\mu (?x_{i}) = c_{x_{i}}\), for each xi in \(\bar {x}\).
Clearly, μ is a pp-solution to both P and P′ over G. However, by the construction of G, the mapping μ′ = μ ∪ {?s ↦ s} is a pp-solution to P′ over G as well. Thus, μ is not a solution to P′ over G, and, since P ⊆ P′, we also have μ ∉ [[P]]G. Consequently, it must be possible to further extend μ′ to a mapping μ″ that is both in [[P]]G and in [[P′]]G, and is defined on all variables in Bψ. Essentially, this means that there is a homomorphism from Bψ to Bvalid that preserves ?s. Consider now the extension α′ of α to the variables \(\bar y\) such that μ″(?yj) = cy⊤ if α′(yj) is true and μ″(?yj) = cy⊥ otherwise. By the construction of Bvalid, this assignment validates all clauses in ψ, which, however, contradicts the assumption that ϕ is false.
Thus, we have shown that ϕ is true if and only if P ≡ P′. □
The existence of a \({{\Pi }_2^p}\) algorithm for containment immediately follows from Lemma 5: to show that P ⊈ P′, for \(P,P^{\prime }\in \mathcal {P}_{\text {wwd}}\), we just need to guess, in NP, a graph G of linear size as well as a mapping μ, check that μ ∉ [[P′]]G, and then call for a coNP oracle for checking that μ ∈ [[P]]G. The claim for patterns in \(\mathcal {U}_{\text {wwd}}\) is similar, but involves guessing a disjunct P1 of P with μ ∈ [[P1]]G and checking \(\mu \notin [{\kern -2.3pt}[ {P^{\prime }_{1}} ]{\kern -2.3pt}]_{G} \) for every disjunct \(P^{\prime }_{1}\) of P′. Since P ≡ P′ if and only if containment holds in both directions, the problem \(\textsc {Equivalence}({\mathcal {U}_{\text {wwd}}})\) is also in \({{\Pi }_{2}^{p}}\).
Hardness follows by the results in [35] for containment and by Lemma 6 for equivalence. □
Hence, for UNION- and FILTER-free patterns, the step from well-designed to weakly well-designed OPT incurs a complexity jump for containment and equivalence. However, for the fragments with UNION or projection complexity remains the same in both cases. As far as we are aware, these are the first decidability results on query equivalence and related problems for SPARQL fragments with OPT and FILTER.
8 Analysis of DBpedia Logs
In this section, we present an analysis of query logs over DBpedia, which suggests that the step from wd-patterns to wwd-patterns makes a dramatic difference in real life: while only about half of the queries with OPT have well-designed patterns, almost all of these patterns fall into the weakly well-designed fragment.
DBpedia [26] is a project providing access to RDF data extracted from Wikipedia via a SPARQL endpoint. DBpedia query logs are well suited for analysing the structure of real-life SPARQL queries as they contain a large amount of general-purpose knowledge base queries, generated both manually and automatically. DBpedia query logs have been analysed by Picalausa and Vansummeren [34], who reported that, over a period in 2010, about 46.38% of a total of 1344K distinct DBpedia queries used OPT. However, only 47.80% of the queries with OPT had well-designed patterns. Another analysis of DBpedia logs from the USEWOD 2011 data set performed by Arias Gallego et al. [7] concluded that 16.61% of about 5166K queries contained OPT; however, detailed structure of queries was not analysed.
We considered query logs over DBpedia 3.9 from USEWOD 2015 [30] and USEWOD 2016 [29]. The USEWOD 2015 DBpedia dataset is a random selection of almost 14M queries from the first half of 2014 while the USEWOD 2016 dataset contains 35M queries from the second half of 2015. We removed syntactically incorrect queries as well as queries outside of \(\mathcal {S}\) (in particular, queries using operators specific to SPARQL 1.1). Also, we rewrote the patterns of the remaining queries to unions of UNION-free patterns as proposed in [33] and eliminated duplicates, which left us with 6.6M queries in USEWOD 2015 and 9.1M queries in USEWOD 2016. Finally, we isolated queries involving OPT and counted how many of their patterns were in \(\mathcal {U}_{\text {wd}}\) and in \(\mathcal {U}_{\text {wwd}}\).
The results are given in Table 1. They confirm that a non-negligible number of DBpedia queries use OPT (over 17%). However, by far not all queries with OPT are well-designed (only about 44% for USEWOD 2015 and 52% for USEWOD 2016), which is consistent with the results in [34]. On the other hand, almost all of the patterns with OPT (over 99.9% in both cases) are weakly well-designed, which we consider as the main practical justification for wwd-patterns.
Table 1
Structure of query patterns in DBpedia logs from USEWOD 2015 and 2016
USEWOD 2015
USEWOD 2016
Unique patterns
Fraction of total
Fraction of patterns with OPT
Unique patterns
Fraction of total
Fraction of patterns with OPT
Total
6 606 201
100%
9 119 492
100%
Patterns with OPT
1 147 704
17.37%
100%
1 582 698
17.36%
100%
Unions of wd-patterns
500 676
7.58%
43.62%
816 276
8.95%
51.57%
Unions of wwd-patterns
1 147 135
17.36%
99.95%
1 582 339
17.35%
99.98%
What about the remaining 0.05% of queries with OPT? We looked at a number of such queries and identified what we believe to be the three most common sources of non-weakly-well-designedness in query patterns. The first and seemingly most common such source is joins between an OPT subpattern and another pattern on a variable that only occurs in the right argument of the OPT subpattern. The following query is an example of such a join:
We believe that the vast majority if not all such queries are erroneous as they are highly unlikely to yield meaningful answers in case the optional part fails to match. Intuitively, one would expect an answer to query (15) to contain the label of an object in variable ?ℓ together with one of its supertypes in variable ?u. And indeed, this is the answer returned by the query on graph G in Fig. 7a (see Fig. 7b). However, if an object in ?s is not assigned any type, which is explicitly allowed by the use of OPT, the query does not just return its label in ?ℓ leaving ?u unbound, as one would expect; instead, it returns the cross product of the label and all types in the graph, which are, of course, completely unrelated to the object in ?s (see Fig. 7c and d).
×
A second source of non-weakly-well-designed patterns is joins between an OPT subpattern and another pattern where the left argument of the OPT subpattern is empty. The following query illustrates this source:
Intuitively, query (16) computes the join between the answers to the pattern (?b,height, ?h) and those to (?b,city, ?c), provided the second set is non-empty; otherwise, the query just returns the answers to (?b,height, ?h). Hence, query (16) is equivalent to the following query in \(\mathcal {S}_{\text {wwd}}\):
The pattern is quite intuitive and it is easy to imagine similar patterns being useful in various applications. However, the normalisation algorithm in [33], which “pushes” unions outside, converts (17) to a pattern that is inherently non-well-designed (due to the two occurrences of ?a in the third disjunct):
We believe that this behaviour is unavoidable in general as we expect query answering to become \({{\Pi }_2^p}\)-hard over patterns that contain UNION in the right argument of OPT. Yet, for certain classes of patterns, generalising (17), one may be able to obtain a coNP evaluation algorithm by accounting for UNION natively rather than relying on the normalisation in [33]. A detailed study of such patterns, however, is outside the scope of the present paper.
9 Conclusion and Future Work
In this paper, we introduced a new fragment of SPARQL patterns called weakly well-designed patterns. This fragment extends the widely studied well-designed fragment by allowing variables from the optional side of an OPT-subpattern that are not “guarded” by the mandatory side to occur in certain positions outside of the subpattern. We showed that queries with wwd-patterns enjoy the same low complexity of evaluation as well-designed queries but cover almost all real-life queries. Moreover, our fragment is the maximal coNP fragment that does not impose structural restrictions on basic patterns and filter conditions. We studied the expressive power of the fragment and the complexity of its query optimisation problems.
For future work, we want to extend wwd-patterns to allow for non-top-level occurrences of UNION and projection. As we have seen in the previous section, this promises to be a challenging task since a naive extension of our definitions to such constructs is likely to increase reasoning complexity. Also, we want to take into account features of SPARQL 1.1 [17] such as GRAPH, NOTEXISTS and property paths. Finally, we would like to implement our ideas in a prototype system and compare its performance with existing SPARQL engines.
Acknowledgements
This work was supported by the EPSRC projects Score!, DBOnto, and ED3.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.