Querying semi-structured data

Serge Abiteboul

Information integration using logical views

A number of ideas concerning information-integration tools can be thought of as constructing answers to queries using views that represent the capabilities of information sources. We review the formal basis of these techniques, which are closely related to containment algorithms for conjunctive queries and/or Datalog programs. Then we compare the approaches taken by AT&T Labs' “Information Manifold” and the Stanford “Tsimmis” project in these terms.

Jeffrey D. Ullman

Methods and problems in data mining

Knowledge discovery in databases and data mining aim at semiautomatic tools for the analysis of large data sets. We consider some methods used in data mining, concentrating on levelwise search for all frequently occurring patterns. We show how this technique can be used in various applications. We also discuss possibilities for compiling data mining queries into algorithms, and look at the use of sampling in data mining. We conclude by listing several open research problems in data mining and knowledge discovery.

Heikki Mannila

Conjunctive query containment revisited

Extended abstract

We consider the problems of conjunctive query containment and minimization, which are known to be NP-complete, and show that these problems can be solved in polynomial time for the class of acyclic queries. We then generalize the notion of acyclicity and define a parameter called query width that captures the “degree of cyclicity” of a query: in particular, a query is acyclic if and only if its query width is 1. We give algorithms for containment and minimization that run in time polynomial in nk, where n is the input size and k is the query width. These algorithms naturally generalize those for acyclic queries, and are of practical significance because many queries have small query width compared to their sizes. We show that we can obtain good bounds on the query width of Q using the treewidth of the incidence graph of Q. Finally, we apply our containment algorithm to the practically important problem of finding equivalent rewritings of a query using a set of materialized views.

Chandra Chekuri, Anand Rajaraman

Semantics and containment of queries with internal and external conjunctions

We study conjunctive queries that combine information from multiple sources. The need for combining information is manifest for instance in multimedia systems. It has recently been recognized that query semantics for these systems should be based on some quantitative model, such as fuzzy logic. Further complications arise, however, since the semantics used internally by subsystems, and the semantics used externally to combine information, are not necessarily the same.In this paper we give a solution based on general multivalued logics with lattice-based semantics. The internal and external semantics are tied to each other through the concept of a bilattice. Queries using both internal level and external level conjunctions have a natural semantics in bilattices. We then show that homomorphism techniques from core database theory carry over to query containment for internal/external conjunctive queries in the bilattice-setting. We also show that the computational complexity of determining containment of internal/external conjunctive queries is in general Π_p2-complete, and NP-complete in some restricted cases.

Gösta Grahne, Nicolas Spyratos, Daniel Stamate

Efficient complete local tests for conjunctive query constraints with negation

We consider the problem of incrementally checking global integrity constraints without using all the relations under constraint. In many application areas such as collaborative design, mobile computing and enterprise information systems, total data availability cannot be assumed. Even if all base data is available, some of it may incur such a high cost that its use should only be considered as a last resort. Without looking at all the base data, how can one meaningfully check a constraint for violation? When the constraint is known to be satisfied prior to the update, the state of the relations that are available (aka local) can in principle be used to infer something about the relations that are not available (aka remote). This observation is the basis for the existence of tests that guarantee that data integrity is preserved under a given update, without looking at all the base data.In order to make integrity maintenance practical, the challenge is to find those tests that are most general (we call them Complete Local Tests or CLT's in short) and that are efficient to generate and execute. This paper addresses the problem of finding efficient CLT's for an important class of constraints that are very common in practice: constraints expressible as conjunctive queries with negated subgoals (abbreviated CQC¬.) We show that for single updates, all CQC¬ constraints admit a CLT that can be expressed in nonrecursive Datalog¬ when the predicates for the remote relations are not repeated in the constraint query. We then extend this result to a larger class of constraints and to certain sets of updates.

Nam Huyn

Selection of views to materialize in a data warehouse

A data warehouse stores materialized views of data from one or more sources, with the purpose of efficiently implementing decision-support or OLAP queries. One of the most important decisions in designing a data warehouse is the selection of materialized views to be maintained at the warehouse. The goal is to select an appropriate set of views that minimizes total query response time and the cost of maintaining the selected views, given a limited amount of resource, e.g., materialization time, storage space etc.In this article, we develop a theoretical framework for the general problem of selection of views in a data warehouse. We present competitive polynomial-time heuristics for selection of views to optimize total query response time, for some important special cases of the general data warehouse scenario, viz.: (i) an AND view graph, where each query/view has a unique evaluation, and (ii) an OR view graph, in which any view can be computed from any one of its related views, e.g., data cubes. We extend the algorithms to the case when there is a set of indexes associated with each view. Finally, we extend our heuristic to the most general case of AND-OR view graphs.

Himanshu Gupta

Total and partial well-founded Datalog coincide

We show that the expressive power of well-founded Datalog does not decrease when restricted to total programs (it is known to decrease from Π₁1 to Δ₁1 on infinite Herbrand structures) thereby affirmatively answering an open question posed by Abiteboul, Hull, and Vianu [AHV95]. In particular, we show that for every well-founded Datalog program there exists an equivalent total program whose only recursive rule is of the form win(¯X) ← move(¯X,¯Y), ¬ win(¯Y) where move is definable by a quantifier-free first-order formula. This yields a nice new normal form for well-founded Datalog and implies that it is sufficient to consider draw-free games in order to evaluate arbitrary Datalog programs under the well-founded semantics.

Jörg Flum, Max Kubierschky, Bertram Ludäscher

Fine hierarchies of generic computation

Suppose that you are a user of a commercial relational database, accessible over the Internet, whose owner has decided to copy the price lists of the US telephone companies — first order queries are for free just like local calls, because they are local by the theorem of GaifmanThese are the rules. Well, what is your strategy, to compute all you want to know about the database, paying as little as possible? And how much will the total price be?We answer this question, showing that the question whether you can get your answer without any costs at all, depends on whether or not the theories of databases in the provided query language are finitely axiomatizable.Thus, assuming there is a limit on the number of variables allowed in queries, if the database query language is the fixpoint logic, you can get everything for free. When it is Datalog however, even with inequality and negation of the edb's, you have to pay. We present a method, which for graphs of n vertices costs about $ log₂ log₂n. Thus querying a graph with 1 Terabyte vertices costs $7.00. We demonstrate that this price cannot be substantially reduced without causing a large computational overhead.

Jerzy Tyszkiewicz

Local properties of query languages

Expressiveness of database query languages remains the major motivation for research in finite model theory. However, most techniques in finite model theory are based on Ehrenfeucht-Fraisse games, whose application often involves a rather intricate argument. Furthermore, most tools apply to first-order logic and some of its extensions, but not to languages that resemble real query languages, like SQL.In this paper we use locality to analyze expressiveness of query languages. A query is local if, to determine if a tuple belongs to the output, one only has to look at a certain predetermined portion of the input.We study local properties of queries in a context that goes beyond the pure first-order case, and then apply the resulting tools to analyze expressive power of SQL-like languages. We first prove a general result describing outputs of local queries, that leads to many easy inexpressibility proofs. We then consider a closely related bounded degree property, which describes the outputs of queries on structures that locally look “simple,” and makes inexpressibility proofs particularly easy. We prove that every local query has this property. Since every relational calculus (first-order) query is local, these results can be viewed as “off-the-shelf” strategies for inexpressibility proofs, which are often easier to apply than the games. We also show that some generalizations of the bounded degree property that were conjectured to hold, fail for relational calculus.We then prove that the language obtained from relational calculus by adding grouping and aggregates (essentially plain SQL), has the bounded degree property, thus solving an open problem. Consequently, first-order queries with Härtig and Rescher quantifiers have the bounded degree property. Finally, we apply our results to show that SQL and relational calculus are incapable of maintaining the transitive closure view even in the presence of certain kinds of auxiliary data.

Guozhu Dong, Leonid Libkin, Limsoon Wong

Expressiveness and complexity of active databases

The expressiveness and complexity of several active database prototypes are formally studied. First, a generic framework for the specification of active databases is developed. This factors out the common aspects of the prototypes considered, and allows studying various active database features independently of any specific prototype. Furthermore, each of the prototypes can be specified by specializing certain parameters of the framework. The prototypes considered are ARDL, HiPAC, Postgres, Starburst, and Sybase. Using their formal specifications, the prototypes are compared to each other with respect to expressive power. The results provide insight into the programming paradigm of active databases, the interplay of various features, and their impact on expressiveness and complexity.

Philippe Picouet, Victor Vianu

A model theoretic approach to update rule programs

Semantics of active rules is generally defined by execution models. The lack of a clean declarative semantics threats active system reliability. In this paper, a declarative semantics for update rules based on the well founded semantics of Datalog programs is investigated. The validation of our proposal proceeds by demonstrating it for static and transition integrity constraint enforcement.

N. Bidoit, S. Maabout

Abstract interpretation of active rules and its use in termination analysis

The behaviour of rules in an active database system can be difficult to predict, and much work has been devoted to the development of automatic support for reasoning about properties such as confluence and termination. We show how abstract interpretation can provide a generic framework for analysis of active rules. Abstract interpretation is a well-understood, semantics-based method for static analysis. Its advantage, apart from generality, lies in the separation of concerns: Once the underlying semantics has been captured formally, a variety of analyses can be derived, almost for free, as approximations to the semantics. Powerful general theorems enable simple proofs of global correctness and uniform termination of specific analyses. We outline these ideas and show, as an example application, a new method for termination analysis. In terms of precision, the method compares favourably with previous solutions to the problem. This is because the method investigates the flow of data rather than just the syntax of conditions and actions.

James Bailey, Lobel Crnogorac, Kotagiri Ramamohanarao, Harald Søndergaard

Structural issues in active rule systems

Active database systems enhance the functionality of traditional databases through the use of active rules or ‘triggers’. There is little consensus, though, on what components should be included in a rule system. In this paper, the expressive power of some simple active database rule systems is examined and the effect of choosing different features studied. Four important parameters of variation are presented, namely the rule language, the external query language, the meta rule language and the pending rule structure. We show that each of these is highly influential in determining the expressiveness of the rule system as a whole, and that an appreciation of them can serve as a basis for understanding the broader picture of system behaviour.

James Bailey, Guozhu Dong, Kotagiri Ramamohanarao

Discovering all most specific sentences by randomized algorithms extended abstract

Data mining can in many instances be viewed as the task of computing a representation of a theory of a model or a database. In this paper we present a randomized algorithm that can be used to compute the representation of a theory in terms of the most specific sentences of that theory. In addition to randomization, the algorithm uses a generalization of the concept of hypergraph transversal. We apply the general algorithm, for discovering maximal frequent sets in 0/1 data, and for computing minimal keys in relations. We present some empirical results on the performance of these methods on real data. We also show some complexity theoretic evidence of the hardness of these problems.

Dimitrios Gunopulos, Heikki Mannila, Sanjeev Saluja

A formal foundation for distributed workflow execution based on state charts

This paper provides a formal foundation for distributed workflow executions. The state chart formalism is adapted to the needs of a workflow model in order to establish a basis for both correctness reasoning and run-time support for complex and large-scale workflow applications. To allow for the distributed execution of a workflow across different workflow servers, which is required for scalability and organizational decentralization, a method for the partitioning of workflow specifications is developed. It is proven that the partitioning preserves the original state chart's behavior.

Dirk Wodtke, Gerhard Weikum

Incorporating user preferences in multimedia queries

Extended abstract

In a multimedia database system, queries may be fuzzy: thus, the answer to a query such as (Color=‘red’) may not be 0 (false) or 1 (true), but instead a number between 0 and 1. A conjunction, such as (Color=‘red’) ∧ (Sound=‘loud’), is evaluated by first evaluating the individual conjuncts and then combining the answers by some scoring function. Typical scoring functions include the min (the standard scoring function for the conjunction in fuzzy logic) and the average. We address the question of how to permit the user to weight the importance of atomic subformulas. In particular, we give a semantics for permitting non-uniform weights, by giving an explicit formula (that is based on the underlying scoring function). This semantics permits an efficient implementation with a low database access cost in a multimedia database system in important cases of interest.

Ronald Fagin, Edward L. Wimmers

Queries and computation on the Web

The paper introduces a model of the Web as an infinite, semi-structured set of objects. We reconsider the classical notions of genericity and computability of queries in this new context and relate them to styles of computation prevalent on the Web, based on browsing and searching. We revisit several well-known declarative query languages (first-order logic, Datalog, and Datalog with negation) and consider their computational characteristics in terms the notions introduced in this paper. In particular, we are interested in languages or fragments thereof which can be implemented by browsing, or by browsing and searching combined. Surprisingly, stratified and well-founded semantics for negation turn out to have basic shortcomings in this context, while inflationary semantics emerges as an appealing alternative.

Serge Abiteboul, Victor Vianu

The complexity of iterated belief revision

In this paper we analyze the complexity of revising a knowledge base when an iteration of this process is necessary. The analysis concerns both the classical problems of belief revision (inference, model checking, computation of the new base) and new issues, related to the problem of “committing” the changes.

Paolo Liberatore

Expressive power of unary counters

We compare the expressive power on finite models of two extensions of first order logic L with equality. L(Ct) is formed by adding an operator count{x:φ}, which builds a term of sort ℕ that counts the number of elements of the finite model satisfying a formula ϕ. Our main result shows that the stronger operator count{t(x): ϕ}, where t(x) is a term of sort ℕ, cannot be expressed in L(Ct). That is, being able to count elements does not allow one to count terms.This paper also continues our interest in new proof techniques in database theory. The proof of the unary counter combines a number of model-theoretic techniques that give powerful tools for expressivity bounds: in particular, we discuss here the use of indiscernibles, the Paris-Harrington form of Ramsey's theorem, and nonstandard models of arithmetic.

Michael Benedikt, H. Jerome Keisler

Concurrency control theory for deferred materialized views

We consider concurrency control problems that arise in the presence of materialized views. Consider a database system supporting materialized views to speed up queries. For a range of important applications (e.g. banking, billing, network management), transactions that access materialized views would like to get some consistency guarantees—if a transaction reads a base relation after an update, and then reads a materialized view derived from the base relation, it expects to see the effect of the base update on the materialized view. If a transaction reads two views, it expects that the two views reflect a single consistent database state.Such guarantees are not easy to obtain, as materialized views become inconsistent upon updates to base relations. Immediate maintenance reestablishes consistency within the transaction that updates the base relation, but this consistency comes at the cost of delaying update transactions. Deferred maintenance has been proposed to avoid penalizing update transactions by shifting maintenance into a different transaction (for example, into the transaction that reads the view). However, doing so causes a materialized view to become temporarily inconsistent with its definition. Consequently, transactions that read multiple materialized views, or that read a materialized view and also read and/or write base relations may execute in a non-serializable manner even when they are running under a strict two phase locking (2PL) protocol.We formalize the concurrency control problem in systems supporting materialized views. We develop a serializability theory based upon conflicts and serialization graphs in the presence of materialized views. Concurrency control algorithms based on this theory are being developed in the SWORD/Ode database system.

Akira Kawaguchi, Daniel Lieuwen, Inderpal Singh Mumick, Dallan Quass, Kenneth A. Ross

Serializability of nested transactions in multidatabases

The correctness of nested transactions for multidatabases differs from that of flat transactions in that, for nested transactions the execution order of siblings at each related site should also be consistent. In this paper we first propose a simple but powerful theory for the serializability of nested transactions in multidatabases and then a technique called Nested Tickets Method for Nested Transactions (NTNT). The NTNT technique provides correctness of nested transactions in multidatabases without violating the local autonomy of the participating DB-MSs. The algorithm is fully distributed, in other words there is no central scheduler. The correctness of the NTNT technique is proved by using the developed theory.

Ugur Halici, Budak Arpinar, Asuman Dogac

Adding structure to unstructured data

We develop a new schema for unstructured data. Traditional schemas resemble the type systems of programming languages. For unstructured data, however, the underlying type may be much less constrained and hence an alternative way of expressing constraints on the data is needed. Here, we propose that both data and schema be represented as edge-labeled graphs. We develop notions of conformance between a graph database and a graph schema and show that there is a natural and efficiently computable ordering on graph schemas. We then examine certain subclasses of schemas and show that schemas are closed under query applications. Finally, we discuss how they may be used in query decomposition and optimization.

Peter Buneman, Susan Davidson, Mary Fernandez, Dan Suciu

Correspondence and translation for heterogeneous data

We presented a specification of the integration of heterogeneous data based on correspondence rules. We showed how a unique specification can served many purposes (including two-way translation) assuming some reasonable restrictions. We claim that the framework and restrictions are acceptable in practice, and in particular one can show that all the document-OODB correspondences/translations of [2, 3] are covered. We are currently working on further substantiating this by more experimentation.When applying the work presented here a number of issues arise such as the specification of default values when some information is missing in the translation. A more complex one is the introduction of some simple constraints in the model, e.g., keys.Another important implementation issue is to choose between keeping one of the representations virtual vs. materializing both. In particular, it is conceivable to apply in this larger setting the optimization techniques developed in a OODB/SGML context for queries [2] and updates [3].

Serge Abiteboul, Sophie Cluet, Tova Milo

Type-consistency problems for queries in object-oriented databases

Method invocation mechanism is one of the essential features in object-oriented programming languages. This mechanism contributes to data encapsulation and code reuse, but there is a risk of run-time type errors. In the case of object-oriented databases (OODBs), a run-time error causes rollback. Therefore, it is desirable to ensure that a given OODB schema is consistent, i.e., no run-time type error occurs during the execution of queries under any database instance of the OODB schema.This paper discusses the computational complexity of type-consistency problems. As a model of OODB schemas, we adopt update schemas introduced by Hull et al., which have all of the basic features of OODBs such as class hierarchy, inheritance, complex objects, and so on. For some subclasses of update schemas, the complexity of a type-consistency problem is presented. Importantly, it turns out that non-flatness of the class hierarchy, nontermination of queries, and update operations in queries each make the problem difficult.

Yasunori Ishihara, Hiroyuki Seki, Minoru Ito

Object-oriented database evolution

An evolution languageis composed of an instance update language, a schema update language, and a mechanism to combine them. We present a formal evolution language for object-oriented database management systems. This language allows to write programs to update simultaneously both the schema and the instance. Static checking of these programs insures that the resulting database is consistent.We propose an autonomous instance update language, based on an adequate specific query language and a pure instance update language. The main features of the query language are a formal type inference system including disjunctive types, and the decidability of the satisfiability problem, despite a negation operator. The pure instance update language allows objects migration, and objects and references creation and deletion; its semantics is declarative, and an algorithm to compute it is presented.We propose an evolution mechanism for combining this instance update language with a classical schema update language, and use it to obtain an evolution language. Decidability of consistency is shown for a fragment of this language, by reduction to first-order logic with two variables.

Jean-Bernard Lagorce, Arūnas Stočkus, Emmanuel Waller

Performance of nearest neighbor queries in R-trees

Nearest neighbor (NN) queries are posed very frequently in spatial applications. Recently a branch- and-bound algorithm based on R-trees has been developed in order to answer efficiently NN queries. In this paper, we combine techniques that were inherently used for the analysis of range and spatial join queries, in order to derive measures regarding the performance of NN queries. We try to estimate the number of disk accesses introduced due to the processing of an NN query. Lower and upper bounds are defined estimating the performance of NN queries very closely. The theoretical analysis is verified with experimental results, under uniform and non-uniform distributions of queries and data, in the 2-dimensional address space.

Apostolos Papadopoulos, Yannis Manolopoulos

Optimal allocation of two-dimensional data (Extended abstract)

Efficient browsing and retrieval of geographically referenced information requires the allocation of data on different storage devices for concurrent retrieval. By dividing a two dimensional space into tiles, a system can allow users to specify regions of interest using a query rectangle and then retrieving all information related to tiles overlapping with the query. In this paper, we derive the necessary and sufficient conditions for strictly optimal allocations of two-dimensional data. These methods, when they exist, guarantee that for any query, the minimum number of tiles are assigned the same storage device, and hence ensures maximal retrieval concurrency.

Khaled A. S. Abdel-Ghaffar, Amr El Abbadi

Efficient indexing for constraint and temporal databases

We examine new I/O-efficient techniques for indexing problems in constraint and temporal data models. We present algorithms for these problems that are considerably simpler than previous solutions. Our solutions are unique in the sense that they only use B+-trees rather than special-purpose data structures. Indexing for many general constraint data models can be reduced to interval intersection. We present a new algorithm for this problem using a query-time/space tradeoff, which achieves the optimal query time O(log_{B n+t/B}) I/O's in linear space O(n/B) using B+-trees. (Here, n is the number of intervals, t the number of intervals in the output of a query, and B the disk block size.) It is easy to update this data structure, but small worst-case bounds do not seem possible. Previous approaches have achieved these bounds but are fairly complex and rely mostly on reducing the interval intersection problem to special cases of two-dimensional search. Some of them can also handle updates in O(log_{B n}) I/O's amortized. Indexing in many temporal models becomes a generalization of interval management, where each temporal object is characterized by an interval and a key. There are many different ways of querying these objects, and we achieve optimal bounds for many of these queries. These bounds are achieved using a modification of the technique used for the constraint indexing problem. Our technique is much simpler than other techniques that have been used for achieving similar bounds.

Sridhar Ramaswamy

On topological elementary equivalence of spatial databases

We consider spatial databases and queries definable using first-order logic and real polynomial inequalities. We are interested in topological queries: queries whose result only depends on the topological aspects of the spatial data. Two spatial databases are called topologically elementary equivalent if they cannot be distinguished by such topological first-order queries. Our contribution is a natural and effective characterization of topological elementary equivalence of closed databases in the real plane. As far as topological elementary equivalence is concerned, it does not matter whether we use first-order logic with full polynomial inequalities, or first-order logic with simple order comparisons only.

Bart Kuijpers, Jan Paredaens, Jan Van den Bussche

Model-theoretic minimal change operators for constraint databases

Database systems should allow users to insert new information (described by a first-order sentence) into a database without specifying exactly how. The database systems should be able to figure out what tuples to add or delete from the current database to satisfy fully the user's request. The guiding principle of accomplishing such insertions is the concept of model-theoretic minimal change. This paper shows that this concept can be applied to constraint databases. In particular, any constraint database change operator that satisfies the axioms for revision [AGM85], update [KM92], or arbitration [Rev96] accomplishes a model-theoretic minimal change in a well-defined sense. The paper also presents concrete operators for revision, update, and arbitration for constraint databases with real polynomial inequality constraints.

Peter Z. Revesz

Tractable iteration mechanisms for bag languages

Preliminary report

The goal of this paper is to study tractable iteration mechanisms for bags. The presence of duplicates in bags prevents iteration mechanisms developed in the context of sets to be directly applied to bags without losing tractability. We study two constructs for controlling tractability of iteration over bags. The deflationary fixpoint construct keeps removing elements from a bag until a fixpoint is reached. The bounded fixpoint construct is an inflationary iteration mechanism that never exceeds some predefined bounding bag. We study these constructs in the context of a standard (nested) bag algebra. We show that the deflationary and bounded inflationary fixpoint constructs are equally expressive and strictly more expressive than their set-based counterparts. We also show that, unlike in the set case, the bag algebra with bounded fixpoint fails to capture all PTIME queries over databases with ordered domains. We then show that adding just one construct, which can be used to assign unique tags to duplicates, captures the class of all polynomial time queries over bags when a total ordering on the domain of atomic elements is available. Finally, we compare the expressive powers of the bag algebra and the nested relational algebra with aggregate functions in the presence of these fixpoint operators.

Latha S. Colby, Leonid Libkin

Springer Professional

Database Theory — ICDT '97

6th International Conference Delphi, Greece, January 8–10, 1997 Proceedings

Über dieses Buch

Inhaltsverzeichnis

Frontmatter