Elsevier

Information Systems

Volume 28, Issue 8, December 2003, Pages 1037-1063
Information Systems

Reasoning about keys for XML

https://doi.org/10.1016/S0306-4379(03)00028-0Get rights and content

Abstract

We study absolute and relative keys for XML, and investigate their associated decision problems. We argue that these keys are important to many forms of hierarchically structured data including XML documents. In contrast to other proposals of keys for XML, we show that these keys are always (finitely) satisfiable, and their (finite) implication problem is finitely axiomatizable. Furthermore, we provide a polynomial time algorithm for determining (finite) implication in the size of keys. Our results also demonstrate, among other things, that the analysis of XML keys is far more intricate than its relational counterpart.

Introduction

Keys are of fundamental importance in databases. They provide a means for locating a specific object within the database and for referencing one object from another. Keys also constitute an important class of constraints on the validity of data. In particular, keys—as opposed to addresses or internal object identifiers—provide an invariant connection between an object in the real world and its representation in the database. This connection is crucial in modifying the database as the world that it models changes.

As XML is increasingly used in data management, it is natural to require a value-based method of locating an element in an XML document. Key specifications for XML have been proposed in the XML standard [1], XML Data [2], and XML Schema [3]. However, existing proposals cannot handle one or more of the following situations. First, as in databases, one may want to define keys with compound structure. For example, the name subelement of a person element could be a natural key, but may itself have first-name and last-name subelements. Keys should not be constrained to be character strings (attribute values). Second, in hierarchically structured data, one may want to identify elements within the scope of a sub-document. For example, the number subelement of a chapter element may be a key for chapters of a specific book, but would not be unique among chapters of different books. We will call a key such as number a relative key of chapter since it is a key in the context of a given book element, i.e., it is to hold on the sub-document rooted at the book element. If the context element is the root of the document, the key is to hold on the entire document, and is referred to as an absolute key. The idea of keys having a scope is not new. In relational databases, scoped keys exist in the form of weak entities. Using the same example, chapter is a weak entity of book. A chapter number would only make sense in the context of a certain book. Third, since XML is not required to conform to a DTD or schema definition, it is useful to have a definition of keys that is independent of any specification (such as a DTD or XML Schema) of the type of an XML document.

To overcome these limitations, the authors have recently [4] provided a definition of keys for XML which appears to be applicable—among other things—to a wide variety of hierarchical scientific data sets. However, reasoning about keys, which is trivial in a relational setting, becomes significantly more complicated in a hierarchical setting. In [5] we described a set of rules for key implication and outlined soundness and completeness results. This paper is a full version of that conference presentation and includes motivation and full proofs. The authors understand [6] that these definitions and results have significantly influenced the design of keys in the current version of XML Schema [3].

In developing this notion of keys for XML, we start with a tree model of data as used in DOM [7], XSL [8], [9], XQL [10] and XML Schema [3]. An example of this representation is shown in Fig. 1. Nodes are annotated by their type: E for element, A for attribute, and S for string (or PCDATA). Some keys for these data might include: (1) A book node is identified by @isbn; (2) An author node is identified by name, no matter where the author node appears; and (3) Within any subtree rooted at book, a chapter node is identified by @number. These keys are defined independently of any type specification. The first two are examples of absolute keys since they must hold globally throughout the tree. Observe that name has a complex structure. As a consequence, checking whether two authors violate this constraint involves testing value-equality on the subtrees rooted at their name nodes. The last one is an example of a relative key since it holds locally within each subtree rooted at a book. It should be noted that a chapter @number is not a key for the set of all chapter nodes in the document since two different books have chapters with @number=1. We remark that proposals prior to [4] were not capable of expressing the second and third constraints.

The notion of relative keys is particularly natural for hierarchically structured data, and is motivated in part by our experience with scientific data formats. Many scientific databases represent and transmit their data in one of a variety of data formats. Some of these data formats are general purpose, e.g. ASN.1, which is used in GenBank [11] and AceDB [12]; Others, such as EMBL, which is used in SwissProt [13], are domain-specific. All of these specifications have a hierarchical structure. As a typical example, SwissProt [14] at the top level consists of a large set of entries, each of which is identified by an accession number. Within each entry there is a sequence of citations, each of which is identified by a number 1,2,3... within the entry. Thus, to identify a citation fully, we need to provide both an accession number for the entry and the number of the citation within the entry. Note that the same number for a citation (e.g. 3) may occur within many different entries, thus the citation number is a relative key within each entry.

All the non-XML data formats mentioned above have an easy conversion to XML. We have also found that the data sets in these formats have a natural key structure. However, it is not the case that it is easy to find a natural non-trivial DTD or XML Schema description for the converted data. Finding an appropriate DTD for a given data format may be problematic, and even if one exists, the conversion to XML conforming to that DTD may be a more complex task. Indeed, in many applications XML data does not come with a DTD or schema. This observation supports our claim that, in some applications, keys should be treated independently of any other type constraints. It is worth mentioning that there has been recent work [15], [16] on the analyses of XML keys in the presence of DTDs.

One of the most interesting questions involving keys is that of logical implication, i.e., deciding if a new key holds, given a set of existing keys. This is important for minimizing the cost of checking that a document satisfies a set of key constraints, and may also provide the basis for reasoning about how constraints can be propagated through view definitions. Thus, a central task for the study of XML keys is to develop an algorithm for determining logical implication. It is also desirable to develop a sound and complete set of inference rules for generating symbolic proofs of logical implication. The existence of a finite set of such inference rules, referred to as finite axiomatizability, is a stronger property than the existence of an algorithm, because the former implies the latter but not the other way around [17]. Another interesting question is whether a set of keys is “reasonable”, that is, if there exists some (finite) document that satisfies the key specification. This is referred to as the (finite) satisfiability problem.

In relational databases, the finite satisfiability problem is trivial: Given any finite set of keys over a relational schema, one can always find a finite instance of the schema that satisfies that set. The (finite) implication problem for keys and, more generally, functional dependencies is also straightforward. It is well-known [17], [18] that the problem is decidable in linear time, and that there are exactly two inference rules that are sound and complete for the implication analysis. Let R be a relation schema and Att(R) denote the set of attributes of R. We use XR to denote that X is a key of R, where XAtt(R). Then the rules can be given as:Att(R)→R(schema),X→RX⊆YY→R(superkey).

The first rule says that for any relation schema R, the set of all the attributes of R is a key of R. The second asserts that if X is a key of R, then so is any superset of X.

For XML the story is more complicated since the hierarchical structure of data is far more complex than the 1NF structure of relational data. In some proposals, keys are not even finitely satisfiable. In XML Schema for example, it is possible to specify a key, in simplified syntax, (//∗,[id]), where “//∗”, in XPath [19] syntax, traverses to any descendant of the root of an XML document tree. This key asserts that any node in an XML tree must have a unique id subelement (of text value) and its id uniquely identifies the node in the entire document. However, no finite XML tree satisfies this key because any id node must have an id itself, and this yields an infinite chain of id nodes. For implication of XML keys, the analysis is even more intriguing. Keys in XML Schema are defined in terms of XPath [19], which is a limited yet complicated language. Recently, it has been shown [20], [21] that it is rather expensive to determine containment of XPath expressions, which is important in the implication analysis of XML keys. Indeed, the containment problem is undecidable in the presence of disjunction, DTDs, and variables [21], and it is coNP-complete even for a small fragment of XPath in the absence of DTDs [20]. For the (finite) axiomatizability of equivalence of XPath expressions, which is important in studying the (finite) axiomatizability of XML key implication, the analysis is even more intriguing [22]. Thus, not surprisingly, reasoning about keys defined in XML Schema is prohibitively expensive: Even for unary keys, i.e., keys defined in terms of a single subelement, the finite satisfiability problem is NP-hard and the implication problem is coNP-hard [23]. For the entire class of keys of XML Schema, to the best of our knowledge, both the implication and axiomatizability problems are still open. The reason that the implication problem in XML Schema is different from that addressed in this paper is twofold. First, there are bad interactions between DTDs and keys, as observed in [15], [16]. Second, XML Schema imposes the additional constraint on keys that they exist and be unique. These are the “strong keys” described in [4].

In contrast, we show in this paper that the basic keys of [4] can be reasoned about efficiently. More specifically, we show that they are finitely satisfiable and their implication is decidable in PTIME. Better still, their (finite) implication is finitely axiomatizable, i.e., there is a finite set of inference rules that is sound and complete for implication of these keys. In developing these results, we also investigate value-equality on XML subtrees and containment of path expressions, which are not only interesting in their own right but also important in the study of decision problems for XML keys.

Despite the importance of reasoning about keys for XML, little previous work has investigated this issue. The only work closely related to this paper is [15], [16], [24]. For a class of keys and foreign keys, the decision problems were studied in the absence [24] and presence [15], [16] of DTDs. The keys considered there are defined in terms of XML attributes and are not as expressive as keys studied in this paper. Integrity constraints defined in terms of navigation paths have been studied for semistructured [25] and XML data in [26], [27], [28]. These constraints are generalizations of inclusion dependencies commonly found in relational databases, and are not capable of expressing keys. Generalizations of functional dependencies (FDs) have also been studied. In the context of object-oriented data models, these generalizations include a definition of FDs in terms of navigation paths [29], [30]; in the context of nested relational models, FDs have been defined in terms of equality on set types [31] and within the scope of nested sets [32]. However, these constraints were investigated in database settings, which are quite different from the tree model for XML data considered in this paper. Surveys on XML constraints can be found in [33], [34].

The remainder of the paper is organized as follows. Section 2 formally defines XML trees, value equality, and (absolute and relative) keys for XML. Section 3 establishes the finite axiomatizability and complexity results: First, we give a quadratic time algorithm for determining inclusion of path expressions. The ability to determine inclusion of path expressions is then used in developing inference rules for keys, for which a PTIME algorithm is given. Finally, Section 4 identifies directions for further research.

Section snippets

Keys

Our notion of keys is based on a tree model of XML data, as illustrated in Fig. 1. Although the model is quite simple, we need to do two things prior to defining keys: The first is to give a precise definition of value equality for XML keys; the second is to describe a path language that will be used to locate sets of nodes in an XML document. We therefore introduce a class of regular path expressions, and define keys in terms of this path language.

Key implication

In this section, we study the finite implication problem for keys. Our main result is the following:

Theorem 2

The finite implication problem for K is finitely axiomatizable and decidable in polynomial time in the size of keys.

We prove this theorem by showing that there is a finite axiomatization (see Lemma 6) and an algorithm for determining finite implication of K constraints (see Lemma 7). A roadmap for the proof of this theorem is as follows. Since our axioms for finite implication for K rely on path

Discussion

We have investigated a key constraint language introduced in [4] for XML data and studied the associated (finite) satisfiability and (finite) implication problems in the absence of DTDs. These keys are capable of expressing many important properties of XML data. Moreover, in contrast to other proposals, keys defined in this language can be reasoned about efficiently. More specifically, keys expressed in this language are always finitely satisfiable, and their (finite) implication is finitely

Acknowledgements

Peter Buneman and Wang-Chiew Tan are supported by NSF IIS 99-77408 and NSF DL-2 IIS 98-17444. Susan Davidson is supported by NSF DBI99-75206. Wenfei Fan is supported in part by NSF Career Award IIS-0093168. Part of this work was done while Wang-Chiew Tan was a Ph.D. student at the University of Pennsylvania. The authors thank Michael Benedikt, Chris Brew, Dave Maier, and Henry Thompson for helpful discussions. The authors are grateful to the referees for valuable suggestions on improving the

References (44)

  • J. Clark, XSL Transformations (XSLT), W3C Recommendation, 1999,...
  • P. Wadler, A formal semantics for patterns in XSL, Technical Report, Computing Sciences Research Center, Bell Labs,...
  • J. Robie, J. Lapp, D. Schach, XML Query Language (XQL), Workshop on XML Query Languages, Boston, MA,...
  • D. Benson

    GenBank

    Nucleic Acids Res.

    (2000)
  • J. Sulston

    The C. elegans genome sequencing projecta beginning

    Nature

    (1992)
  • W. Baker

    The EMBL nucleotide sequence database

    Nucleic Acids Res.

    (2000)
  • A. Bairoch et al.

    The SWISS-PROT protein sequence database and its supplement TrEMBL

    Nucleic Acids Res.

    (2000)
  • M. Arenas, W. Fan, L. Libkin, On verifying consistency of XML specifications, in: Proceedings of ACM Symposium on...
  • W. Fan et al.

    On XML integrity constraints in the presence of DTDs

    J. Assoc. Comput. Mach.

    (2002)
  • S. Abiteboul et al.

    Foundations of Databases

    (1995)
  • R. Ramakrishnan et al.

    Database Management Systems

    (2000)
  • J. Clark, S. DeRose, XML Path Language (XPath), W3C Working Draft, 1999,...
  • Cited by (90)

    • An efficient similarity-based approach for comparing XML documents

      2018, Information Systems
      Citation Excerpt :

      The existence of an ID attribute for a given node provides a unique condition to match nodes: the nodes in both revisions must have the same ID value to be matched. In fact, the use of primary keys in XML has received attention in the past [11,12], and nowadays W3C provides a way to express them in XML Schema [13]. However, schemas are not mandatory and most XML documents do not have them [14–16].

    • A hybrid logic for XML reference constraints

      2018, Data and Knowledge Engineering
      Citation Excerpt :

      Unfortunately, as observed in [8,11,12], XML Schema is too complicate and not compact at all in the specification of even simple integrity constraints. Focusing on the research area related to XML modeling, in the literature there have been research efforts dealing with the proposal of formal approaches to specify different kinds of integrity constraints, trying to maintain the compactness of DTD specification [8,11,13–19]. Most efforts focused on formalisms allowing the expression of some kinds of key and foreign key constraints, by suitably extending DTD syntax and semantics [11,16].

    • XML data exchange with target constraints

      2013, Information Processing and Management
    • When conceptual model meets grammar: A dual approach to XML data modeling

      2012, Data and Knowledge Engineering
      Citation Excerpt :

      Apart from that, XML Schema involves new constructs assert and report that correspond to Schematron rules and enable us to express more complex conditions. Basic foundations and classifications of XML keys and discussion of the related decision problems can be found in [19], satisfiability of specification of keys is studied in [10,39]. Other kinds of XML integrity constraints are studied in [40,41].

    View all citing articles on Scopus

    Recommended by Prof. Dr. Gottfried Vossen.

    View full text