Patterns and Types for Querying XML Documents

Abstract

Among various proposals for primitives for deconstructing XML data two approaches seem to clearly stem from practice: path expressions, widely adopted by the database community, and regular expression patterns, mainly developed and studied in the programming language community. We think that the two approaches are complementary and should be both integrated in languages for XML, and we see in that an opportunity of collaboration between the two communities. With this aim, we give a presentation of regular expression patterns and the type systems they are tightly coupled with. Although this article advocates a construction promoted by the programming language community, we will try to stress some characteristics that the database community, we hope, may find interesting.

Giuseppe Castagna

Dual Syntax for XML Languages

Abstract

XML is successful as a machine processable data interchange format, but it is often too verbose for human use. For this reason, many XML languages permit an alternative more legible non-XML syntax. XSLT stylesheets are often used to convert from the XML syntax to the alternative syntax; however, such transformations are not reversible since no general tool exists to automatically parse the alternative syntax back into XML.

We present XSugar, which makes it possible to manage dual syntax for XML languages. An XSugar specification is built around a context-free grammar that unifies the two syntaxes of a language. Given such a specification, the XSugar tool can translate from alternative syntax to XML and vice versa. Moreover, the tool statically checks that the transformations are reversible and that all XML documents generated from the alternative syntax are valid according to a given XML schema.

Claus Brabrand, Anders Møller, Michael I. Schwartzbach

Exploiting Schemas in Data Synchronization

Abstract

Increased reliance on optimistic data replication has led to burgeoning interest in tools and frameworks for synchronizing disconnected updates to replicated data. We have implemented a generic synchronization framework, called Harmony, that can be used to build statebased synchronizers for a wide variety of tree-structured data formats. A novel feature of this framework is that the synchronization process—in particular, the recognition of conflicts—is driven by the schema of the structures being synchronized. We formalize Harmony’s synchronization algorithm, state a simple and intuitive specification, and illustrate how it can be used to synchronize trees representing a variety of specific forms of application data, including sets, records, and tuples.

J. Nathan Foster, Michael B. Greenwald, Christian Kirkegaard, Benjamin C. Pierce, Alan Schmitt

Efficiently Enumerating Results of Keyword Search

Abstract

Various approaches for keyword search in different settings (e.g., relational databases, XML and the Web) actually deal with the problem of enumerating K-fragments. For a given set of keywords K, a K-fragment is a subtree T of the given data graph, such that T contains all the keywords of K and no proper subtree of T has this property. There are three types of K-fragments: rooted, undirected and strong. This paper describes the first provably efficient algorithms for enumerating K-fragments. Specifically, for all three types of K-fragments, algorithms are given for enumerating all K-fragments with polynomial delay. For rooted K-fragments and acyclic data graphs, an algorithm is given for enumerating with polynomial delay in the order of increasing weight (i.e., the ranked order), assuming that K is of a fixed size. Finally, an efficient algorithm is described for enumerating K-fragments in a heuristically ranked order.

Benny Kimelfeld, Yehoshua Sagiv

Mapping Maintenance in XML P2P Databases

Abstract

Unstructured p2p database systems are usually characterized by the presence of schema mappings among peers. In these systems, the detection of corrupted mappings is a key problem. A corrupted mapping fails in matching the target or the source schema, hence it is not able to transform data conforming to a schema \(\mathcal{S}_i\) into data conforming to a schema \(\mathcal{S}_j\), nor it can be used for effective query reformulation.

This paper describes a novel technique for maintaining mappings in XML p2p databases, based on a semantic notion of mapping correctness.

Dario Colazzo, Carlo Sartiani

Inconsistency Tolerance in P2P Data Integration: An Epistemic Logic Approach

Abstract

We study peer-to-peer data integration, where each peer models an autonomous system that exports data in terms of its own schema, and data interoperation is achieved by means of mappings among the peer schemas, rather than through a global schema. We propose a multi-modal epistemic semantics based on the idea that each peer is conceived as a rational agent that exchanges knowledge/belief with other peers, thus nicely modeling the modular structure of the system. We then address the issue of dealing with possible inconsistencies, and distinguish between two types of inconsistencies, called local and P2P, respectively. We define a nonmonotonic extension of our logic that is able to reason on the beliefs of peers under inconsistency tolerance. Tolerance to local inconsistency essentially means that the presence of inconsistency within one peer does not affect the consistency of the whole system. Tolerance to P2P inconsistency means being able to resolve inconsistencies arising from the interaction between peers. We study query answering and its data complexity in this setting, and we present an algorithm that is sound and complete with respect to the proposed semantics, and optimal with respect to worst-case complexity.

Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Riccardo Rosati

XML Data Integration with Identification

Abstract

Data integration is the problem of combining data residing at different sources, and providing the user with a virtual view, called global schema, which is independent from the model and the physical origin of the sources. Whereas many data integration systems and theoretical works have been proposed for relational data, not much investigation has been focused yet on XML data integration. Our goal is therefore to address some of its related issues. In particular, we highlight two major issues that emerge in the XML context: (i) the global schema may be characterized by a set of constraints, expressed by means of a DTD and XML integrity constraints, (ii) the concept of node identity requires to introduce semantic criteria to identify nodes coming from different sources. We propose a formal framework for XML data integration systems based on an expressive XML global schema, a set of XML data sources and a set of mappings specified by means of a simple tree language. Then, we define an identification function that aims at globally identifying nodes coming from different sources. Finally, we propose algorithms to answer queries under different assumptions for the mappings.

Antonella Poggi, Serge Abiteboul

Satisfiability of XPath Queries with Sibling Axes

Abstract

We study the satisfiability problem for XPath fragments supporting the following-sibling and preceding-sibling axes. Although this problem was recently studied for XPath fragments without sibling axes, little is known about the impact of the sibling axes on the satisfiability analysis. To this end we revisit the satisfiability problem for a variety of XPath fragments with sibling axes, in the presence of DTDs, in the absence of DTDs, and under various restricted DTDs. In these settings we establish complexity bounds ranging from NLOGSPACE to undecidable. Our main conclusion is that in many cases, the presence of sibling axes complicates the satisfiability analysis. Indeed, we show that there are XPath satisfiability problems that are in PTIME and PSPACE in the absence of sibling axes, but that become NP-hard and EXPTIME-hard, respectively, when sibling axes are used instead of the corresponding vertical modalities (e.g., the wildcard and the descendant axis).

Floris Geerts, Wenfei Fan

XML Subtree Queries: Specification and Composition

Abstract

A frequent task encountered in XML processing is to filter an input document to produce a subdocument; that is, a document whose root-to-leaf paths are root-to-leaf paths of the original document and which inherits the tree structure of the original document. These are what we mean by subtree queries, and while they are similar to XPath filters, they cannot be naturally specified either in XPath or in XQuery. Special-purpose subtree query languages provide a natural idiom for specifying this class of queries, but both composition and evaluation are problematic. In this paper we show that for natural fragments of XPath, the resulting subtree query languages are closed under composition. This closure property allows a sequence of subtree queries to be rewritten as a single subtree query, which can then be evaluated either by a subtree-query specific evaluator or via translation to XQuery. We provide a set of composition algorithms for each common XPath fragment and discuss their complexity.

Michael Benedikt, Irini Fundulaki

On the Expressive Power of XQuery Fragments

Abstract

XQuery is known to be a powerful XML query language with many bells and whistles. For many common queries we do not need all the expressive power of XQuery. We investigate the effect of omitting certain features of XQuery on the expressive power of the language. We start from a simple base fragment which can be extended by several optional features being aggregation functions such as count and sum, sequence generation, node construction, position information in for loops, and recursion. In this way we obtain 64 different XQuery fragments which can be divided into 17 different equivalence classes such that two fragments can express the same functions iff they are in the same equivalence class. Moreover, we investigate the relationships between these equivalence classes.

Jan Hidders, Stefania Marrara, Jan Paredaens, Roel Vercammen

A Type Safe DOM API

Abstract

DOM (Document Object Model) is the W3C recommendation for an API for manipulating XML documents. The API is stated in terms of a language neutral IDL with normative bindings given for Java and ECMAScript. The type system underlying the DOM API is a simple object-based one which relies entirely on interfaces and interface extension. However, this simplicity is deceiving because the DOM architects implicitly impose a large number of constraints which are only stated informally. The long list of exceptions that some DOM methods may raise witnesses these constraints.

The present work defines a refinement of Java’s type system which makes most of these constraints accessible and thus checkable to the compiler. Technically, we graft a polymorphic annotation system on top of the host language’s types and propagate the annotations using ideas borrowed from affine type systems. We provide a type soundness proof with respect to an operational semantics of a Java core language.

Peter Thiemann

Type-Based Optimization for Regular Patterns

Abstract

Pattern matching mechanisms based on regular expressions feature in a number of recent languages for processing XML. The flexibility of these mechanisms demands novel approaches to the familiar problems of pattern-match compilation—how to minimize the number of tests performed during pattern matching while keeping the size of the output code small.

We describe semantic compilation methods in which we use the schema of the value flowing into a pattern matching expression to generate efficient target code. We start by discussing a pragmatic algorithm used currently in the compiler of Xtatic and report some preliminary performance results. For a more fundamental analysis, we define an optimality criterion of “no useless tests” and show that it is not satisfied by Xtatic’s algorithm. We constructively demonstrate that the problem of generating optimal pattern matching code is decidable for finite (non-recursive) patterns.

Michael Y. Levin, Benjamin C. Pierce

Efficient Memory Representation of XML Documents

Abstract

Implementations that load XML documents and give access to them via, e.g., the DOM, suffer from huge memory demands: the space needed to load an XML document is usually many times larger than the size of the document. A considerable amount of memory is needed to store the tree structure of the XML document. Here a technique is presented that allows to represent the tree structure of an XML document in an efficient way. The representation exploits the high regularity in XML documents by “compressing” their tree structure; the latter means to detect and remove repetitions of tree patterns. The functionality of basic tree operations, like traversal along edges, is preserved in the compressed representation. This allows to directly execute queries (and in particular, bulk operations) without prior decompression. For certain tasks like validation against an XML type or checking equality of documents, the representation allows for provably more efficient algorithms than those running on conventional representations.

Giorgio Busatto, Markus Lohrey, Sebastian Maneth

N-Ary Queries by Tree Automata

Abstract

We investigate n-ary node selection queries in trees by successful runs of tree automata. We show that run-based n-ary queries capture MSO, contribute algorithms for enumerating answers of n-ary queries, and study the complexity of the problem. We investigate the subclass of run-based n-ary queries by unambiguous tree automata.

Joachim Niehren, Laurent Planque, Jean-Marc Talbot, Sophie Tison

Minimizing Tree Automata for Unranked Trees

Abstract

Automata for unranked trees form a foundation for XML schemas, querying and pattern languages. We study the problem of efficiently minimizing such automata. We start with the unranked tree automata (UTAs) that are standard in database theory, assuming bottom-up determinism and that horizontal recursion is represented by deterministic finite automata. We show that minimal UTAs in that class are not unique and that minimization is np-hard. We then study more recent automata classes that do allow for polynomial time minimization. Among those, we show that bottom-up deterministic stepwise tree automata yield the most succinct representations.

Wim Martens, Joachim Niehren

Dependency-Preserving Normalization of Relational and XML Data

Abstract

Having a database design that avoids redundant information and update anomalies is the main goal of normalization techniques. Ideally, data as well as constraints should be preserved. However, this is not always achievable: while BCNF eliminates all redundancies, it may not preserve constraints, and 3NF, which achieves dependency preservation, may not always eliminate all redundancies.

Our first goal is to investigate how much redundancy 3NF tolerates in order to achieve dependency preservation. We apply an information-theoretic measure and show that only prime attributes admit redundant information in 3NF, but their information content may be arbitrarily low.

Then we study the possibility of achieving both redundancy elimination and dependency preservation by a hierarchical representation of relational data in XML. We provide a characterization of cases when an XML normal form called XNF guarantees both.

Finally, we deal with dependency preservation in XML and show that like in the relational case, normalizing XML documents to achieve non-redundant data can result in losing constraints. By modifying the definition of XNF, we define another normal form for XML documents, X3NF, that generalizes 3NF for the case of XML and achieves dependency preservation.

Solmaz Kolahi

Complexity and Approximation of Fixing Numerical Attributes in Databases Under Integrity Constraints

Abstract

Consistent query answering is the problem of computing the answers from a database that are consistent with respect to certain integrity constraints that the database as a whole may fail to satisfy. Those answers are characterized as those that are invariant under minimal forms of restoring the consistency of the database. In this context, we study the problem of repairing databases by fixing integer numerical values at the attribute level with respect to denial and aggregation constraints. We introduce a quantitative definition of database fix, and investigate the complexity of several decision and optimization problems, including DFP, i.e. the existence of fixes within a given distance from the original instance, and CQA, i.e. deciding consistency of answers to aggregate conjunctive queries under different semantics. We provide sharp complexity bounds, identify relevant tractable cases; and introduce approximation algorithms for some of those that are intractable. More specifically, we obtain results like undecidability of existence of fixes for aggregation constraints; MAXSNP-hardness of DFP, but a good approximation algorithm for a relevant special case; and intractability but good approximation for CQA for aggregate queries for one database atom denials (plus built-ins).

Leopoldo Bertossi, Loreto Bravo, Enrico Franconi, Andrei Lopatenko

Consistent Query Answers on Numerical Databases Under Aggregate Constraints

Abstract

The problem of extracting consistent information from relational databases violating integrity constraints on numerical data is addressed. In particular, aggregate constraints defined as linear inequalities on aggregate-sum queries on input data are considered. The notion of repair as consistent set of updates at attribute-value level is exploited, and the characterization of several data-complexity issues related to repairing data and computing consistent query answers is provided.

Sergio Flesca, Filippo Furfaro, Francesco Parisi

Springer Professional

Database Programming Languages

10th International Workshop, DBPL 2005, Trondheim, Norway, August 28-29, 2005, Revised Selected Papers

Table of Contents

Frontmatter

Patterns and Types for Querying XML Documents

Dual Syntax for XML Languages

Exploiting Schemas in Data Synchronization

Efficiently Enumerating Results of Keyword Search

Mapping Maintenance in XML P2P Databases

Inconsistency Tolerance in P2P Data Integration: An Epistemic Logic Approach

XML Data Integration with Identification

Satisfiability of XPath Queries with Sibling Axes

XML Subtree Queries: Specification and Composition

On the Expressive Power of XQuery Fragments

A Type Safe DOM API

Type-Based Optimization for Regular Patterns

Efficient Memory Representation of XML Documents

N-Ary Queries by Tree Automata

Minimizing Tree Automata for Unranked Trees

Dependency-Preserving Normalization of Relational and XML Data

Complexity and Approximation of Fixing Numerical Attributes in Databases Under Integrity Constraints

Consistent Query Answers on Numerical Databases Under Aggregate Constraints

Backmatter

Premium Partner