Invited Papers

Converged Services: A Hidden Challenge for the Web Services Paradigm

The web has brought a revolution in sharing information and in human-computer interaction. The web services paradigm (based initially on standards such as SOAP, WSDL, UDDI, BPEL) will bring the next revolution, enabling flexible, intricate, and largely automated interactions between web-resident services and applications. But the telecommunications world is also changing, from isolated, monolithic legacy stove-pipes, to a much more modular, internet-style framework that will enable rich flexibility in creating communication and collaboration services. This will be enabled by the existing Parlay/OSA standard and emerging standards for all-IP networks, (e.g., 3GPP IMS). We are evolving towards a world of “converged” services, not two parallel worlds of web services vs. telecom services.

Richard Hull

GRIDS, Databases, and Information Systems Engineering Research

GRID technology, emerging in the late nineties, has evolved from a metacomputing architecture towards a pervasive computation and information utility. However, the architectural developments echo strongly the computational origins and information systems engineering aspects have received scant attention. The development within the GRID community of the W3C-inspired OGSA indicates a willingness to move in a direction more suited to the wider end user requirements. In particular the OGSA/DAI initiative provides a web-services level interface to databases. In contrast to this stream of development, early architectural ideas for a more general GRIDs environment articulated in UK in 1999 have recently been more widely accepted, modified, evolved and enhanced by a group of experts working under the auspices of the new EC DGINFSO F2 (GRIDs) Unit. The resulting report on ‘Next Generation GRIDs’ was published in June 2003 and is released by the EC as an adjunct to the FP6 Call for Proposals Documentation. The report proposes the need for a wealth of research in all aspects of information systems engineering, within which the topics of advanced distributed parallel multimedia heterogeneous database systems with greater representativity and expressivity have some prominence. Topics such as metadata, security, trust, persistence, performance, scalability are all included. This represents a huge opportunity for the database community, particularly in Europe.

Keith G. Jeffery

Security and Privacy for Web Databases and Services

A semantic web can be thought of as a web that is highly intelligent and sophisticated and one needs little or no human intervention to carry out tasks such as scheduling appointments, coordinating activities, searching for complex documents as well as integrating disparate databases and information systems. While much progress has been made toward developing such an intelligent web, there is still a lot to be done. For example, there is little work on security and privacy for the semantic web. However, before we examine security for the semantic web we need to ensure that its key components, such as web databases and services, are secure. This paper will mainly focus on security and privacy issues for web databases and services. Finally, some directions toward developing a secure semantic web will be provided.

Elena Ferrari, Bhavani Thuraisingham

Distributed, Mobile, and Peer-to-Peer Database Systems

Content-Based Routing of Path Queries in Peer-to-Peer Systems

Peer-to-peer (P2P) systems are gaining increasing popularity as a scalable means to share data among a large number of autonomous nodes. In this paper, we consider the case in which the nodes in a P2P system store XML documents. We propose a fully decentralized approach to the problem of routing path queries among the nodes of a P2P system based on maintaining specialized data structures, called filters that efficiently summarize the content, i.e., the documents, of one or more node. Our proposed filters, called multi-level Bloom filters, are based on extending Bloom filters so that they maintain information about the structure of the documents. In addition, we advocate building a hierarchical organization of nodes by clustering together nodes with similar content. Similarity between nodes is related to the similarity between the corresponding filters. We also present an efficient method for update propagation. Our experimental results show that multi-level Bloom filters outperform the classical Bloom filters in routing path queries. Furthermore, the content-based hierarchical grouping of nodes increases recall, that is, the number of documents that are retrieved.

Georgia Koloniari, Evaggelia Pitoura

Energy-Conserving Air Indexes for Nearest Neighbor Search

A location-based service (LBS) provides information based on the location information specified in a query. Nearest-neighbor (NN) search is an important class of queries supported in LBSs. This paper studies energy-conserving air indexes for NN search in a wireless broadcast environment. Linear access requirement of wireless broadcast weakens the performance of existing search algorithms designed for traditional spatial database. In this paper, we propose a new energy-conserving index, called grid-partition index, which enables a single linear scan of the index for any NN queries. The idea is to partition the search space for NN queries into grid cells and index all the objects that are potential nearest neighbors of a query point in each grid cell. Three grid partition schemes are proposed for the grid-partition index. Performance of the proposed grid-partition indexes and two representative traditional indexes (enhanced for wireless broadcast) is evaluated using both synthetic and real data. The result shows that the grid-partition index substantially outperforms the traditional indexes.

Baihua Zheng, Jianliang Xu, Wang-Chien Lee, Dik Lun Lee

MobiEyes: Distributed Processing of Continuously Moving Queries on Moving Objects in a Mobile System

Location monitoring is an important issue for real time management of mobile object positions. Significant research efforts have been dedicated to techniques for efficient processing of spatial continuous queries on moving objects in a centralized location monitoring system. Surprisingly, very few have promoted a distributed approach to real-time location monitoring. In this paper we present a distributed and scalable solution to processing continuously moving queries on moving objects and describe the design of MobiEyes, a distributed real-time location monitoring system in a mobile environment. Mobieyes utilizes the computational power at mobile objects, leading to significant savings in terms of server load and messaging cost when compared to solutions relying on central processing of location information at the server. We introduce a set of optimization techniques, such as Lazy Query Propagation, Query Grouping, and Safe Periods, to constrict the amount of computations handled by the moving objects and to enhance the performance and system utilization of Mobieyes. We also provide a simulation model in a mobile setup to study the scalability of the MobiEyes distributed location monitoring approach with regard to server load, messaging cost, and amount of computation required on the mobile objects.

Buğra Gedik, Ling Liu

Data Mining and Knowledge Discovery

DBDC: Density Based Distributed Clustering

Clustering has become an increasingly important task in modern application domains such as marketing and purchasing assistance, multimedia, molecular biology as well as many others. In most of these areas, the data are originally collected at different sites. In order to extract information from these data, they are merged at a central site and then clustered. In this paper, we propose a different approach. We cluster the data locally and extract suitable representatives from these clusters. These representatives are sent to a global server site where we restore the complete clustering based on the local representatives. This approach is very efficient, because the local clustering can be carried out quickly and independently from each other. Furthermore, we have low transmission cost, as the number of transmitted representatives is much smaller than the cardinality of the complete data set. Based on this small number of representatives, the global clustering can be done very efficiently. For both the local and the global clustering, we use a density based clustering algorithm. The combination of both the local and the global clustering forms our new DBDC (Density Based Distributed Clustering) algorithm. Furthermore, we discuss the complex problem of finding a suitable quality measure for evaluating distributed clusterings. We introduce two quality criteria which are compared to each other and which allow us to evaluate the quality of our DBDC algorithm. In our experimental evaluation, we will show that we do not have to sacrifice clustering quality in order to gain an efficiency advantage when using our distributed clustering approach.

Eshref Januzaj, Hans-Peter Kriegel, Martin Pfeifle

Iterative Incremental Clustering of Time Series

We present a novel anytime version of partitional clustering algorithm, such as k-Means and EM, for time series. The algorithm works by leveraging off the multi-resolution property of wavelets. The dilemma of choosing the initial centers is mitigated by initializing the centers at each approximation level, using the final centers returned by the coarser representations. In addition to casting the clustering algorithms as anytime algorithms, this approach has two other very desirable properties. By working at lower dimensionalities we can efficiently avoid local minima. Therefore, the quality of the clustering is usually better than the batch algorithm. In addition, even if the algorithm is run to completion, our approach is much faster than its batch counterpart. We explain, and empirically demonstrate these surprising and desirable properties with comprehensive experiments on several publicly available real data sets. We further demonstrate that our approach can be generalized to a framework of much broader range of algorithms or data mining problems.

Jessica Lin, Michail Vlachos, Eamonn Keogh, Dimitrios Gunopulos

LIMBO: Scalable Clustering of Categorical Data

Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. We introduce LIMBO, a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. As a hierarchical algorithm, LIMBO has the advantage that it can produce clusterings of different sizes in a single execution. We use the IB framework to define a distance measure for categorical tuples and we also present a novel distance measure for categorical attribute values. We show how the LIMBO algorithm can be used to cluster both tuples and values. LIMBO handles large data sets by producing a memory bounded summary model for the data. We present an experimental evaluation of LIMBO, and we study how clustering quality compares to other categorical clustering algorithms. LIMBO supports a trade-off between efficiency (in terms of space and time) and quality. We quantify this trade-off and demonstrate that LIMBO allows for substantial improvements in efficiency with negligible decrease in quality.

Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller, Kenneth C. Sevcik

Trustworthy Database Systems

A Framework for Efficient Storage Security in RDBMS

With the widespread use of e-business coupled with the public’s awareness of data privacy issues and recent database security related legislations, incorporating security features into modern database products has become an increasingly important topic. Several database vendors already offer integrated solutions that provide data privacy within existing products. However, treating security and privacy issues as an afterthought often results in inefficient implementations. Some notable RDBMS storage models (such as the N-ary Storage Model) suffer from this problem. In this work, we analyze issues in storage security and discuss a number of trade-offs between security and efficiency. We then propose a new secure storage model and a key management architecture which enable efficient cryptographic operations while maintaining a very high level of security. We also assess the performance of our proposed model by experimenting with a prototype implementation based on the well-known TPC-H data set.

Bala Iyer, Sharad Mehrotra, Einar Mykletun, Gene Tsudik, Yonghua Wu

Beyond 1-Safety and 2-Safety for Replicated Databases: Group-Safety

In this paper, we study the safety guarantees of group communication-based database replication techniques. We show that there is a model mismatch between group communication and database, and because of this, classical group communication systems cannot be used to build 2-safe database replication. We propose a new group communication primitive called end-to-end atomic broadcast that solves the problem, i.e., can be used to implement 2-safe database replication. We also introduce a new safety criterion, called group-safety, that has advantages both over 1-safety and 2-safety. Experimental results show the gain of efficiency of group-safety over lazy replication, which ensures only 1-safety.

Matthias Wiesmann, André Schiper

A Condensation Approach to Privacy Preserving Data Mining

In recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. In many cases, users are unwilling to provide personal information unless the privacy of sensitive information is guaranteed. In this paper, we propose a new framework for privacy preserving data mining of multi-dimensional data. Previous work for privacy preserving data mining uses a perturbation approach which reconstructs data distributions in order to perform the mining. Such an approach treats each dimension independently and therefore ignores the correlations between the different dimensions. In addition, it requires the development of a new distribution based algorithm for each data mining problem, since it does not use the multi-dimensional records, but uses aggregate distributions of the data as input. This leads to a fundamental re-design of data mining algorithms. In this paper, we will develop a new and flexible approach for privacy preserving data mining which does not require new problem-specific algorithms, since it maps the original data set into a new anonymized data set. This anonymized data closely matches the characteristics of the original data including the correlations among the different dimensions. We present empirical results illustrating the effectiveness of the method.

Charu C. Aggarwal, Philip S. Yu

Innovative Query Processing Techniques for XML Data

Efficient Query Evaluation over Compressed XML Data

XML suffers from the major limitation of high redundancy. Even if compression can be beneficial for XML data, however, once compressed, the data can be seldom browsed and queried in an efficient way. To address this problem, we propose XQueC, an [XQue]ry processor and [C]ompressor, which covers a large set of XQuery queries in the compressed domain. We shred compressed XML into suitable data structures, aiming at both reducing memory usage at query time and querying data while compressed. XQueC is the first system to take advantage of a query workload to choose the compression algorithms, and to group the compressed data granules according to their common properties. By means of experiments, we show that good trade-offs between compression ratio and query capability can be achieved in several real cases, as those covered by an XML benchmark. On average, XQueC improves over previous XML query-aware compression systems, still being reasonably closer to general-purpose query-unaware XML compressors. Finally, QETs for a wide variety of queries show that XQueC can reach speed comparable to XQuery engines on uncompressed data.

Andrei Arion, Angela Bonifati, Gianni Costa, Sandra D’Aguanno, Ioana Manolescu, Andrea Pugliese

XQzip: Querying Compressed XML Using Structural Indexing

XML makes data flexible in representation and easily portable on the Web but it also substantially inflates data size as a consequence of using tags to describe data. Although many effective XML compressors, such as XMill, have been recently proposed to solve this data inflation problem, they do not address the problem of running queries on compressed XML data. More recently, some compressors have been proposed to query compressed XML data. However, the compression ratio of these compressors is usually worse than that of XMill and that of the generic compressor gzip, while their query performance and the expressive power of the query language they support are inadequate.In this paper, we propose XQzip, an XML compressor which supports querying compressed XML data by imposing an indexing structure, which we call Structure Index Tree (SIT), on XML data. XQzip addresses both the compression and query performance problems of existing XML compressors. We evaluate XQzip’s performance extensively on a wide spectrum of benchmark XML data sources. On average, XQzip is able to achieve a compression ratio 16.7% better and a querying time 12.84 times less than another known queriable XML compressor. In addition, XQzip supports a wide scope of XPath queries such as multiple, deeply nested predicates and aggregation.

James Cheng, Wilfred Ng

HOPI: An Efficient Connection Index for Complex XML Document Collections

In this paper we present HOPI, a new connection index for XML documents based on the concept of the 2–hop cover of a directed graph introduced by Cohen et al. In contrast to most of the prior work on XML indexing we consider not only paths with child or parent relationships between the nodes, but also provide space– and time–efficient reachability tests along the ancestor, descendant, and link axes to support path expressions with wildcards in our XXL search engine. We improve the theoretical concept of a 2–hop cover by developing scalable methods for index creation on very large XML data collections with long paths and extensive cross–linkage. Our experiments show substantial savings in the query performance of the HOPI index over previously proposed index structures in combination with low space requirements.

Ralf Schenkel, Anja Theobald, Gerhard Weikum

Data and Information Management on the Web

Efficient Distributed Skylining for Web Information Systems

Though skyline queries already have claimed their place in retrieval over central databases, their application in Web information systems up to now was impossible due to the distributed aspect of retrieval over Web sources. But due to the amount, variety and volatile nature of information accessible over the Internet extended query capabilities are crucial. We show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying today’s Web information systems. Together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the real-time challenges of on-line information services. We discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. For the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement.

Wolf-Tilo Balke, Ulrich Güntzer, Jason Xin Zheng

Query-Customized Rewriting and Deployment of DB-to-XML Mappings

Given the current trend towards application interoperability and XML-based data integration, there is an increasing need for XML interfaces to relational database management systems. In this paper we consider the problem of rewriting a DB-to-XML mapping, into several modified mappings in order to support clients that require various portions of the mapping-defined data. Mapping rewriting has the effect of reducing the amount of shipped data and, potentially, query processing time at the client. We ship sufficient data to correctly answer the client queries. Various techniques to further limit the amount of shipped data are examined. We have conducted experiments to validate the usefulness of our shipped data reduction techniques in the context of the TPC-W benchmark. The experiments confirm that in reasonable applications, data reduction is indeed significant (60-90%).

Oded Shmueli, George Mihaila, Sriram Padmanabhan

LexEQUAL: Supporting Multiscript Matching in Database Systems

To effectively support today’s global economy, database systems need to store and manipulate text data in multiple languages simultaneously. Current database systems do support the storage and management of multilingual data, but are not capable of querying or matching text data across different scripts. As a first step towards addressing this lacuna, we propose here a new query operator called LexEQUAL, which supports multiscript matching of proper names. The operator is implemented by first transforming matches in multiscript text space into matches in the equivalent phoneme space, and then using standard approximate matching techniques to compare these phoneme strings. The algorithm incorporates tunable parameters that impact the phonetic match quality and thereby determine the match performance in the multiscript space. We evaluate the performance of the LexEQUAL operator on a real multiscript names dataset and demonstrate that it is possible to simultaneously achieve good recall and precision by appropriate parameter settings. We also show that the operator run-time can be made extremely efficient by utilizing a combination of q-gram and database indexing techniques. Thus, we show that the LexEQUAL operator can complement the standard lexicographic operators, representing a first step towards achieving complete multilingual functionality in database systems.

A. Kumaran, Jayant R. Haritsa

Innovative Modelling Concepts for Spatial and Temporal Databases

A Model for Ternary Projective Relations between Regions

Current spatial database systems offer limited querying capabilities beyond topological relations. This paper introduces a model for projective relations between regions to support other qualitative spatial queries. The relations are ternary because they are based on the collinearity invariant of three points under projective geometry. The model is built on a partition of the plane in five regions that are obtained from projective properties of two reference objects: then, by considering the empty/non empty intersections of a primary object with these five regions, the model is able to distinguish between 31 different projective relations.

Roland Billen, Eliseo Clementini

Computing and Handling Cardinal Direction Information

Qualitative spatial reasoning forms an important part of the commonsense reasoning required for building intelligent Geographical Information Systems (GIS). Previous research has come up with models to capture cardinal direction relations for typical GIS data. In this paper, we target the problem of efficiently computing the cardinal direction relations between regions that are composed of sets of polygons and present the first two algorithms for this task. The first of the proposed algorithms is purely qualitative and computes, in linear time, the cardinal direction relations between the input regions. The second has a quantitative aspect and computes, also in linear time, the cardinal direction relations with percentages between the input regions. The algorithms have been implemented and embedded in an actual system, CarDirect, that allows the user to annotate regions of interest in an image or a map, compute cardinal direction relations and retrieve combinations of interesting regions on the basis of a query.

Spiros Skiadopoulos, Christos Giannoukos, Panos Vassiliadis, Timos Sellis, Manolis Koubarakis

A Tale of Two Schemas: Creating a Temporal XML Schema from a Snapshot Schema with τXSchema

The W3C XML Schema recommendation defines the structure and data types for XML documents. XML Schema lacks explicit support for time-varying XML documents. Users have to resort to ad hoc, non-standard mechanisms to create schemas for time-varying XML documents. This paper presents a data model and architecture, called τXSchema, for creating a temporal schema from a non-temporal (snapshot) schema, a temporal annotation, and a physical annotation. The annotations specify which portion(s) of an XML document can vary over time, how the document can change, and where timestamps should be placed. The advantage of using annotations to denote the time-varying aspects is that logical and physical data independence for temporal schemas can be achieved while remaining fully compatible with both existing XML Schema documents and the XML Schema recommendation.

Faiz Currim, Sabah Currim, Curtis Dyreson, Richard T. Snodgrass

Query Processing Techniques for Spatial Databases

Spatial Queries in the Presence of Obstacles

Despite the existence of obstacles in many database applications, traditional spatial query processing utilizes the Euclidean distance metric assuming that points in space are directly reachable. In this paper, we study spatial queries in the presence of obstacles, where the obstructed distance between two points is defined as the length of the shortest path that connects them without crossing any obstacles. We propose efficient algorithms for the most important query types, namely, range search, nearest neighbors, e-distance joins and closest pairs, considering that both data objects and obstacles are indexed by R-trees. The effectiveness of the proposed solutions is verified through extensive experiments.

Jun Zhang, Dimitris Papadias, Kyriakos Mouratidis, Manli Zhu

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

Efficient search for nearest neighbors (NN) is a fundamental problem arising in a large variety of applications of vast practical interest. In this paper we propose a novel technique, called NNH (“Nearest Neighbor Histograms”), which uses specific histogram structures to improve the performance of NN search algorithms. A primary feature of our proposal is that such histogram structures can co-exist in conjunction with a plethora of NN search algorithms without the need to substantially modify them. The main idea behind our proposal is to choose a small number of pivot objects in the space, and pre-calculate the distances to their nearest neighbors. We provide a complete specification of such histogram structures and show how to use the information they provide towards more effective searching. In particular, we show how to construct them, how to decide the number of pivots, how to choose pivot objects, how to incrementally maintain them under dynamic updates, and how to utilize them in conjunction with a variety of NN search algorithms to improve the performance of NN searches. Our intensive experiments show that nearest neighbor histograms can be efficiently constructed and maintained, and when used in conjunction with a variety of algorithms for NN search, they can improve the performance dramatically.

Liang Jin, Nick Koudas, Chen Li

Clustering Multidimensional Extended Objects to Speed Up Execution of Spatial Queries

We present a cost-based adaptive clustering method to improve average performance of spatial queries (intersection, containment, enclosure queries) over large collections of multidimensional extended objects (hyper-intervals or hyper-rectangles). Our object clustering strategy is based on a cost model which takes into account the spatial object distribution, the query distribution, and a set of database and system parameters affecting the query performance: object size, access time, transfer and verification costs. We also employ a new grouping criterion to group objects in clusters, more efficient than traditional approaches based on minimum bounding in all dimensions. Our cost model is flexible and can accommodate different storage scenarios: in-memory or disk-based. Experimental evaluations show that our approach is efficient in a number of situations involving large spatial databases with many dimensions.

Cristian-Augustin Saita, François Llirbat

Foundations of Query Processing

Processing Unions of Conjunctive Queries with Negation under Limited Access Patterns

We study the problem of answering queries over sources with limited access patterns. The problem is to decide whether a given query Q is feasible, i.e., equivalent to an executable query Q′ that observes the limited access patterns given by the sources. We characterize the complexity of deciding feasibility for the classes CQ¬ (conjunctive queries with negation) and UCQ¬ (unions of CQ¬ queries): Testing feasibility is just as hard as testing containment and therefore $\Pi^{P}_{2}$-complete. We also provide a uniform treatment for CQ, UCQ, CQ¬, and UCQ¬ by devising a single algorithm which is optimal for each of these classes. In addition, we show how one can often avoid the worst-case complexity by certain approximations: At compile-time, even if a query Q is not feasible, we can find efficiently the minimal executable query containing Q. For query answering at runtime, we devise an algorithm which may report complete answers even in the case of infeasible plans and which can indicate to the user the degree of completeness for certain incomplete answers.

Alan Nash, Bertram Ludäscher

Projection Pushing Revisited

The join operation, which combines tuples from multiple relations, is the most fundamental and, typically, the most expensive operation in database queries. The standard approach to join-query optimization is cost based, which requires developing a cost model, assigning an estimated cost to each query-processing plan, and searching in the space of all plans for a plan of minimal cost. Two other approaches can be found in the database-theory literature. The first approach, initially proposed by Chandra and Merlin, focused on minimizing the number of joins rather then on selecting an optimal join order. Unfortunately, this approach requires a homomorphism test, which itself is NP-complete, and has not been pursued in practical query processing. The second, more recent, approach focuses on structural properties of the query in order to find a project-join order that will minimize the size of intermediate results during query evaluation. For example, it is known that for Boolean project-join queries a project-join order can be found such that the arity of intermediate results is the treewidth of the join graph plus one.In this paper we pursue the structural-optimization approach, motivated by its success in the context of constraint satisfaction. We chose a setup in which the cost-based approach is rather ineffective; we generate project-join queries with a large number of relations over databases with small relations. We show that a standard SQL planner (we use PostgreSQL) spends an exponential amount of time on generating plans for such queries, with rather dismal results in terms of performance. We then show how structural techniques, including projection pushing and join reordering, can yield exponential improvements in query execution time. Finally, we combine early projection and join reordering in an implementation of the bucket-elimination method from constraint satisfaction to obtain another exponential improvement.

Benjamin J. McMahan, Guoqiang Pan, Patrick Porter, Moshe Y. Vardi

On Containment of Conjunctive Queries with Arithmetic Comparisons

We study the following problem: how to test if Q₂ is contained in Q₁, where Q₁ and Q₂ are conjunctive queries with arithmetic comparisons? This problem is fundamental in a large variety of database applications. Existing algorithms first normalize the queries, then test a logical implication using multiple containment mappings from Q₁ to Q₂. We are interested in cases where the containment can be tested more efficiently. This work aims to (a) reduce the problem complexity from Π$^{P}_{\rm 2}$-completeness to NP-completeness in these cases; (b) utilize the advantages of the homomorphism property (i.e., the containment test is based on a single containment mapping) in applications such as those of answering queries using views; and (c) observing that many real queries have the homomorphism property. The following are our results. (1) We show several cases where the normalization step is not needed, thus reducing the size of the queries and the number of containment mappings. (2) We develop an algorithm for checking various syntactic conditions on queries, under which the homomorphism property holds. (3) We further reduce the conditions of these classes using practical domain knowledge that is easily obtainable. (4) We conducted experiments on real queries, and show that most of the queries pass this test.

Foto Afrati, Chen Li, Prasenjit Mitra

XPath with Conditional Axis Relations

This paper is about the W3C standard node-addressing language for XML documents, called XPath. XPath is still under development. Version 2.0 appeared in 2001 while the theoretical foundations of Version 1.0 (dating from 1998) are still being widely studied. The paper aims at bringing XPath to a “stable fixed point” in its development: a version which is expressively complete, still manageable computationally, with a user-friendly syntax and a natural semantics. We focus on an important axis relation which is not expressible in XPath 1.0 and is very useful in practice: the conditional axis. With it we can express paths specified by for instance “do a child step, while test is true at the resulting node”. We study the effect of adding conditional axis relations to XPath on its expressive power and the complexity of the query evaluation and query equivalence problems. We define an XPath dialect $\mathcal{X}$CPath which is expressively complete, has a linear time query evaluation algorithm and for which query equivalence given a DTD can be decided in exponential time.

Maarten Marx

Advanced Query Processing and Optimization

Declustering Two-Dimensional Datasets over MEMS-Based Storage

Due to the large difference between seek time and transfer time in current disk technology, it is advantageous to perform large I/O using a single sequential access rather than multiple small random I/O accesses. However, prior optimal cost and data placement approaches for processing range queries over two-dimensional datasets do not consider this property. In particular, these techniques do not consider the issue of sequential data placement when multiple I/O blocks need to be retrieved from a single device. In this paper, we reevaluate the optimal cost of range queries by declustering two-dimensional datasets over multiple devices, and prove that, in general, it is impossible to achieve the new optimal cost. This is because disks cannot facilitate two-dimensional sequential access which is required by the new optimal cost. Fortunately, MEMS-based storage is being developed to reduce I/O cost. We first show that the two-dimensional sequential access requirement can not be satisfied by simply modeling MEMS-based storage as conventional disks. Then we propose a new placement scheme that exploits the physical properties of MEMS-based storage to solve this problem. Our theoretical analysis and experimental results show that the new scheme achieves almost optimal results.

Hailing Yu, Divyakant Agrawal, Amr El Abbadi

Self-tuning UDF Cost Modeling Using the Memory-Limited Quadtree

Query optimizers in object-relational database management systems require users to provide the execution cost models of user-defined functions(UDFs). Despite this need, however, there has been little work done to provide such a model. Furthermore, none of the existing work is self-tuning and, therefore, cannot adapt to changing UDF execution patterns. This paper addresses this problem by introducing a self-tuning cost modeling approach based on the quadtree. The quadtree has the inherent desirable properties to (1) perform fast retrievals, (2) allow for fast incremental updates (without storing individual data points), and (3) store information at different resolutions. We take advantage of these properties of the quadtree and add the following in order to make the quadtree useful for UDF cost modeling: the abilities to (1) adapt to changing UDF execution patterns and (2) use limited memory. To this end, we have developed a novel technique we call the memory-limited quadtree(MLQ). In MLQ, each instance of UDF execution is mapped to a query point in a multi-dimensional space. Then, a prediction is made at the query point, and the actual value at the point is inserted as a new data point. The quadtree is then used to store summary information of the data points at different resolutions based on the distribution of the data points. This information is used to make predictions, guide the insertion of new data points, and guide the compression of the quadtree when the memory limit is reached. We have conducted extensive performance evaluations comparing MLQ with the existing (static) approach.

Zhen He, Byung S. Lee, Robert R. Snapp

Distributed Query Optimization by Query Trading

Large-scale distributed environments, where each node is completely autonomous and offers services to its peers through external communication, pose significant challenges to query processing and optimization. Autonomy is the main source of the problem, as it results in lack of knowledge about any particular node with respect to the information it can produce and its characteristics. Inter-node competition is another source of the problem, as it results in potentially inconsistent behavior of the nodes at different times. In this paper, inspired by e-commerce technology, we recognize queries (and query answers) as commodities and model query optimization as a trading negotiation process. Query parts (and their answers) are traded between nodes until deals are struck with some nodes for all of them. We identify the key parameters of this framework and suggest several potential alternatives for each one. Finally, we conclude with some experiments that demonstrate the scalability and performance characteristics of our approach compared to those of traditional query optimization.

Fragkiskos Pentaris, Yannis Ioannidis

Query Processing Techniques for Stream Data

Sketch-Based Multi-query Processing over Data Streams

Recent years have witnessed an increasing interest in designing algorithms for querying and analyzing streaming data (i.e., data that is seen only once in a fixed order) with only limited memory. Providing (perhaps approximate) answers to queries over such continuous data streams is a crucial requirement for many application environments; examples include large telecom and IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed.Randomized techniques, based on computing small “sketch” synopses for each stream, have recently been shown to be a very effective tool for approximating the result of a single SQL query over streaming data tuples. In this paper, we investigate the problems arising when data-stream sketches are used to process multiple such queries concurrently. We demonstrate that, in the presence of multiple query expressions, intelligently sharing sketches among concurrent query evaluations can result in substantial improvements in the utilization of the available sketching space and the quality of the resulting approximation error guarantees. We provide necessary and sufficient conditions for multi-query sketch sharing that guarantee the correctness of the result-estimation process. We also prove that optimal sketch sharing typically gives rise to $\mathcal{NP}$-hard questions, and we propose novel heuristic algorithms for finding good sketch-sharing configurations in practice. Results from our experimental study with realistic workloads verify the effectiveness of our approach, clearly demonstrating the benefits of our sketch-sharing methodology.

Alin Dobra, Minos Garofalakis, Johannes Gehrke, Rajeev Rastogi

Processing Data-Stream Join Aggregates Using Skimmed Sketches

There is a growing interest in on-line algorithms for analyzing and querying data streams, that examine each stream element only once and have at their disposal, only a limited amount of memory. Providing (perhaps approximate) answers to aggregate queries over such streams is a crucial requirement for many application environments; examples include large IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed. In this paper, we present the skimmed-sketch algorithm for estimating the join size of two streams. (Our techniques also readily extend to other join-aggregate queries.) To the best of our knowledge, our skimmed-sketch technique is the first comprehensive join-size estimation algorithm to provide tight error guarantees while: (1) achieving the lower bound on the space required by any join-size estimation method in a streaming environment, (2) handling streams containing general update operations (inserts and deletes), (3) incurring a low logarithmic processing time per stream element, and (4) not assuming any a-priori knowledge of the frequency distribution for domain values. Our skimmed-sketch technique achieves all of the above by first skimming the dense frequencies from random hash-sketch summaries of the two streams. It then computes the subjoin size involving only dense frequencies directly, and uses the skimmed sketches only to approximate subjoin sizes for the non-dense frequencies. Results from our experimental study with real-life as well as synthetic data streams indicate that our skimmed-sketch algorithm provides significantly more accurate estimates for join sizes compared to earlier sketch-based techniques.

Sumit Ganguly, Minos Garofalakis, Rajeev Rastogi

Joining Punctuated Streams

We focus on stream join optimization by exploiting the constraints that are dynamically embedded into data streams to signal the end of transmitting certain attribute values. These constraints are called punctuations. Our stream join operator, PJoin, is able to remove no-longer-useful data from the state in a timely manner based on punctuations, thus reducing memory overhead and improving the efficiency of probing. We equip PJoin with several alternate strategies for purging the state and for propagating punctuations to benefit down-stream operators. We also present an extensive experimental study to explore the performance gains achieved by purging state as well as the trade-off between different purge strategies. Our experimental results of comparing the performance of PJoin with XJoin, a stream join operator without a constraint-exploiting mechanism, show that PJoin significantly outperforms XJoin with regard to both memory overhead and throughput.

Luping Ding, Nishant Mehta, Elke A. Rundensteiner, George T. Heineman

Analysis and Validation Techniques for Data and Schemas

Using Convolution to Mine Obscure Periodic Patterns in One Pass

The mining of periodic patterns in time series databases is an interesting data mining problem that can be envisioned as a tool for forecasting and predicting the future behavior of time series data. Existing periodic patterns mining algorithms either assume that the periodic rate (or simply the period) is user-specified, or try to detect potential values for the period in a separate phase. The former assumption is a considerable disadvantage, especially in time series databases where the period is not known a priori. The latter approach results in a multi-pass algorithm, which on the other hand is to be avoided in online environments (e.g., data streams). In this paper, we develop an algorithm that mines periodic patterns in time series databases with unknown or obscure periods such that discovering the period is part of the mining process. Based on convolution, our algorithm requires only one pass over a time series of length n, with O(n log n) time complexity.

Mohamed G. Elfeky, Walid G. Aref, Ahmed K. Elmagarmid

CUBE File: A File Structure for Hierarchically Clustered OLAP Cubes

Hierarchical clustering has been proved an effective means for physically organizing large fact tables since it reduces significantly the I/O cost during ad hoc OLAP query evaluation. In this paper, we propose a novel multidimensional file structure for organizing the most detailed data of a cube, the CUBE File. The CUBE File achieves hierarchical clustering of the data, enabling fast access via hierarchical restrictions. Moreover, it imposes a low storage cost and adapts perfectly to the extensive sparseness of the data space achieving a high compression rate. Our results show that the CUBE File outperforms the most effective method proposed up to now for hierarchically clustering the cube, resulting in 7-9 times less I/Os on average for all workloads tested. Thus, it achieves a higher degree of hierarchical clustering. Moreover, the CUBE File imposes a 2-3 times lower storage cost.

Nikos Karayannidis, Timos Sellis, Yannis Kouvaras

Efficient Schema-Based Revalidation of XML

As XML schemas evolve over time or as applications are integrated, it is sometimes necessary to validate an XML document known to conform to one schema with respect to another schema. More generally, XML documents known to conform to a schema may be modified, and then, require validation with respect to another schema. Recently, solutions have been proposed for incremental validation of XML documents. These solutions assume that the initial schema to which a document conforms and the final schema with which it must be validated after modifications are the same. Moreover, they assume that the input document may be preprocessed, which in certain situations, may be computationally and memory intensive. In this paper, we describe how knowledge of conformance to an XML Schema (or DTD) may be used to determine conformance to another XML Schema (or DTD) efficiently. We examine both the situation where an XML document is modified before it is to be revalidated and the situation where it is unmodified.

Mukund Raghavachari, Oded Shmueli

Multimedia and Quality-Aware Systems

Hierarchical In-Network Data Aggregation with Quality Guarantees

Earlier work has demonstrated the effectiveness of in-network data aggregation in order to minimize the amount of messages exchanged during continuous queries in large sensor networks. The key idea is to build an aggregation tree, in which parent nodes aggregate the values received from their children. Nevertheless, for large sensor networks with severe energy constraints the reduction obtained through the aggregation tree might not be sufficient. In this paper we extend prior work on in-network data aggregation to support approximate evaluation of queries to further reduce the number of exchanged messages among the nodes and extend the longevity of the network. A key ingredient to our framework is the notion of the residual mode of operation that is used to eliminate messages from sibling nodes when their cumulative change is small. We introduce a new algorithm, based on potential gains, which adaptively redistributes the error thresholds to those nodes that benefit the most and tries to minimize the total number of transmitted messages in the network. Our experiments demonstrate that our techniques significantly outperform previous approaches and reduce the network traffic by exploiting the super-imposed tree hierarchy.

Antonios Deligiannakis, Yannis Kotidis, Nick Roussopoulos

Efficient Similarity Search for Hierarchical Data in Large Databases

Structured and semi-structured object representations are getting more and more important for modern database applications. Examples for such data are hierarchical structures including chemical compounds, XML data or image data. As a key feature, database systems have to support the search for similar objects where it is important to take into account both the structure and the content features of the objects. A successful approach is to use the edit distance for tree structured data. As the computation of this measure is NP-complete, constrained edit distances have been successfully applied to trees. While yielding good results, they are still computationally complex and, therefore, of limited benefit for searching in large databases. In this paper, we propose a filter and refinement architecture to overcome this problem. We present a set of new filter methods for structural and for content-based information in tree-structured data as well as ways to flexibly combine different filter criteria. The efficiency of our methods, resulting from the good selectivity of the filters is demonstrated in extensive experiments with real-world applications.

Karin Kailing, Hans-Peter Kriegel, Stefan Schönauer, Thomas Seidl

QuaSAQ: An Approach to Enabling End-to-End QoS for Multimedia Databases

The paper discusses the design and prototype implementation of a QoS-aware multimedia database system. Recent research in multimedia databases has devoted little attention to the aspect of the integration of QoS support at the user level. Our proposed architecture to enable end-to-end QoS control, the QoS-Aware Query Processor (QuaSAQ), satisfies user specified quality requirements. The users need not be aware of detailed low-level QoS parameters, but rather specifies high-level, qualitative attributes. In addition to an overview of key research issues in the design of QoS-aware databases, this paper presents our proposed solutions, and system implementation details. An important issue relates to the enumeration and evaluation of alternative plans for servicing QoS-enhanced queries. This step follows the conventional query execution which results in the identification of objects of interest to the user. We propose a novel cost model for media delivery that explicitly takes the resource utilization of the plan and the current system contention level into account. Experiments run on the QuaSAQ prototype show significantly improved QoS and system throughput.

Yi-Cheng Tu, Sunil Prabhakar, Ahmed K. Elmagarmid, Radu Sion

Indexing Techniques

On Indexing Sliding Windows over Online Data Streams

We consider indexing sliding windows in main memory over on-line data streams. Our proposed data structures and query semantics are based on a division of the sliding window into sub-windows. By classifying windowed operators according to their method of execution, we motivate the need for two types of windowed indices: those which provide a list of attribute values and their counts for answering set-valued queries, and those which provide direct access to tuples for answering attribute-valued queries. We propose and evaluate indices for both of these cases and show that our techniques are more efficient than executing windowed queries without an index.

Lukasz Golab, Shaveen Garg, M. Tamer Özsu

A Framework for Access Methods for Versioned Data

This paper presents a framework for understanding and constructing access methods for versioned data. Records are associated with version ranges in a version tree. A minimal representation for the end set of a version range is given. We show how, within a page, a compact representation of a record can be made using start version of the version range only. Current-version splits, version-and-key splits and consolidations are explained. These operations preserve an invariant which allows visiting only one page at each level of the access method when doing exact-match search (no backtracking). Splits and consolidations also enable efficient stabbing queries by clustering data alive at a given version into a small number of data pages. Last, we survey the methods in the literature to show in what ways they conform or do not conform to our framework. These methods include temporal access methods, branched versioning access methods and spatio-temporal access methods. Our contribution is not to create a new access method but to bring to light fundamental properties of version-splitting access methods and to provide a blueprint for future versioned access methods. In addition, we have not made the unrealistic assumption that transactions creating a new version make only one update, and have shown how to treat multiple updates.

Betty Salzberg, Linan Jiang, David Lomet, Manuel Barrena, Jing Shan, Evangelos Kanoulas

Management of Highly Dynamic Multidimensional Data in a Cluster of Workstations

Due to the proliferation and widespread use of mobile devices and satellite based sensors there has been increased interest in storing and managing spatio-temporal and sensory data. It has been recognized that centralized and monolithic index structures are not scalable enough to address the highly dynamic nature (high update rates) and the unpredictable access patterns in such datasets. In this paper, we propose an adaptive networked index method designed to address the above challenges. Our method not only facilitates fast query and update response times via dynamic data partitioning but is also able to self-tune highly loaded sites. Our contributions consist of techniques that offer dynamic load balancing of computing sites, non-disruptive on-the-fly addition/removal of storing sites, distributed collaborative decision making for the self-administering of the manager, and statistics-based data reorganization. These features are incorporated into a distributed software layer prototype used to evaluate the design choices made. Our experimentation compares the performance of a baseline configuration with our multi-site system, examines the attained speed-up as a function of the sites participating, investigates the effect of data reorganization on query/update response times, asserts the effectiveness of our proposed dynamic load balancing method, and examines the behavior of the system under diverse types of multi-dimensional data.

Vassil Kriakov, Alex Delis, George Kollios

Imprecise Information and Approximate Queries

Spatiotemporal Compression Techniques for Moving Point Objects

Moving object data handling has received a fair share of attention over recent years in the spatial database community. This is understandable as positioning technology is rapidly making its way into the consumer market, not only through the already ubiquitous cell phone but soon also through small, on-board positioning devices in many means of transport and in other types of portable equipment. It is thus to be expected that all these devices will start to generate an unprecedented data stream of time-stamped positions. Sooner or later, such enormous volumes of data will lead to storage, transmission, computation, and display challenges. Hence, the need for compression techniques.Although previously some work has been done in compression for time series data, this work mainly deals with one-dimensional time series. On the other hand, they are good for short time series and in absence of noise, two characteristics not met by moving objects.We target applications in which present and past positions of objects are important, so focus on the compression of moving object trajectories. The paper applies some older techniques of line generalization, and compares their performance against algorithms that we specifically designed for compressing moving object trajectories.

Nirvana Meratnia, Rolf A. de By

Non-contiguous Sequence Pattern Queries

Non-contiguous subsequence pattern queries search for symbol instances in a long sequence that satisfy some soft temporal constraints. In this paper, we propose a methodology that indexes long sequences, in order to efficiently process such queries. The sequence data are decomposed into tables and queries are evaluated as multiway joins between them. We describe non-blocking join operators and provide query preprocessing and optimization techniques that tighten the join predicates and suggest a good join order plan. As opposed to previous approaches, our method can efficiently handle a broader range of queries and can be easily supported by existing DBMS. Its efficiency is evaluated by experimentation on synthetic and real data.

Nikos Mamoulis, Man Lung Yiu

Industrial Papers

Mining Extremely Skewed Trading Anomalies

Trading surveillance systems screen and detect anomalous trades of equity, bonds, mortgage certificates among others. This is to satisfy federal trading regulations as well as to prevent crimes, such as insider trading and money laundry. Most existing trading surveillance systems are based on hand-coded expert-rules. Such systems are known to result in long developing process and extremely high “false positive” rates. We participate in co-developing a data mining based automatic trading surveillance system for one of the biggest banks in the US. The challenge of this task is to handle very skewed positive classes (< 0.01%) as well as very large volume of data (millions of records and hundreds of features). The combination of very skewed distribution and huge data volume poses new challenge for data mining; previous work addresses these issues separately, and existing solutions are rather complicated and not very straightforward to implement. In this paper, we propose a simple systematic approach to mine “very skewed distribution in very large volume of data”.

Wei Fan, Philip S. Yu, Haixun Wang

Flexible Integration of Molecular-Biological Annotation Data: The GenMapper Approach

Molecular-biological annotation data is continuously being collected, curated and made accessible in numerous public data sources. Integration of this data is a major challenge in bioinformatics. We present the GenMapper system that physically integrates heterogeneous annotation data in a flexible way and supports large-scale analysis on the integrated data. It uses a generic data model to uniformly represent different kinds of annotations originating from different data sources. Existing associations between objects, which represent valuable biological knowledge, are explicitly utilized to drive data integration and combine annotation knowledge from different sources. To serve specific analysis needs, powerful operators are provided to derive tailored annotation views from the generic data representation. GenMapper is operational and has been successfully used for large-scale functional profiling of genes. Interactive access is provided under http://www.izbi.de.

Hong-Hai Do, Erhard Rahm

Demo Papers

Meta-SQL: Towards Practical Meta-Querying

Enterprise databases often contain not only ordinary data, but also queries. Examples are view definitions in the system catalog; usage logs or workloads; and stored procedures as in SQL/PSM or Postgres. Unfortunately, these queries are typically stored as long strings, which makes it hard to use standard SQL to express meta-queries: queries about queries. Meta-querying is an important activity in situations such as advanced database administration, database usage monitoring, and workload analysis. Here are some examples of meta-queries to a usage log. (i) “Which queries in the log do the most joins?” (ii) “Which queries in the log return an empty answer on the current instance of the database?” (iii) View expansion: “Replace, in each query in the log, each view name by its definition as given in the system catalog.” (iv) Given a list of new view definitions (under the old names): “Which queries in the log give a different answer on the current instance under the new view definitions?”

Jan Van den Bussche, Stijn Vansummeren, Gottfried Vossen

A Framework for Context-Aware Adaptable Web Services

The trend towards pervasive computing involves an increasing number of ubiquitous, connected devices. As a consequence, the heterogeneity of client capabilities and the number of methods for accessing information services on the Internet also increases. Nevertheless, consumers expect information services to be accessible from all of these devices in a similar fashion. They also expect that information services are aware of their current environment. Generally, this kind of information is called context. More precisely, in our work context constitutes information about consumers and their environment that may be used by Web services to provide consumers a customized and personalized behaviour.

Markus Keidl, Alfons Kemper

Aggregation of Continuous Monitoring Queries in Wireless Sensor Networking Systems

In this demo paper, we introduce our sensor query aggregation system, Continuous Monitoring Query Evaluation System (CMQES), which is a test-bed for studying the aggregation problems of continuous monitoring queries (CMQ) on sensor data streams in a wireless sensor networking system. A CMQ monitors the data values from sensor nodes distributed in the system environment. The objective of CMQES is to provide a configurable test-bed to study various important issues in aggregation of CMQ. An aggregation scheme, Sequential Pushing (SeqPush) has been implemented. Various system settings and parameters can easily be configured with the interface provided in CMQES, and statistics such as the number of messages, complexity in processing and the energy consumption rate at the sensor nodes, are collected.

Kam-Yiu Lam, Henry C. W. Pang

eVitae: An Event-Based Electronic Chronicle

We present an event based system for storing, managing, and presenting personal multimedia history. At the heart of our approach is information organization using the concept of an event. Events allow modeling of the data in a manner that is independent of media. To store events, a novel database called EventBase is developed which is indexed by events. The unique characteristics of events make multidimensional querying and multiple perspective explorations of personal history information feasible. In this demo we present the major functions of eVitae.

Bin Wu, Rahul Singh, Punit Gupta, Ramesh Jain

CAT: Correct Answers of Continuous Queries Using Triggers

Consider the query Q1: Retrieve all the motels which will be no further then 1.5 miles from my route, sometime between 7:00PM and 8:30PM, which a mobile user posed to the Moving Objects Database (MOD). Processing such queries is of interest to wide range of applications (e.g. tourist information systems and context awareness [1,2]). These queries pertain to the future of a dynamic world. Since MOD is only a model of the objects moving in the real world, the accuracy of the representation has to be continuously veri.ed and updated, and the answer-set of Q1 has to be re-evaluated in every clock-tick1. However, the re-evaluation of such queries can be avoided if an update to the MOD does not a.ect the answer-set.

Goce Trajcevski, Peter Scheuermann, Ouri Wolfson, Nimesh Nedungadi

Hippo: A System for Computing Consistent Answers to a Class of SQL Queries

Integrity constraints express important properties of data, but the task of preserving data consistency is becoming increasingly problematic with new database applications. For example, in the case of integration of several data sources, even if the sources are separately consistent, the integrated data can violate the integrity constraints. The traditional approach, removing the conflicting data, is not a good option because the sources can be autonomous. Another scenario is a long-running activity where consistency can be violated only temporarily and future updates will restore it. Finally, data consistency may be neglected because of efficiency or other reasons.

Jan Chomicki, Jerzy Marcinkowski, Slawomir Staworko

An Implementation of P3P Using Database Technology

The privacy of personal information on the Internet has become a major concern for governments, businesses, media, and the public. Platform for Privacy Preferences (P3P), developed by the World Wide Web Consortium (W3C), is the most significant effort underway to enable web users to gain more control over their private information. P3P provides mechanisms for a web site to encode its data-collection and data-use practices in a standard XML format, known as a P3P policy [3], which can be programmatically checked against a user’s privacy preferences.

Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant, Yirong Xu

XQBE: A Graphical Interface for XQuery Engines

XQuery is increasingly popular among computer scientists with a SQL background, since queries in XQuery and SQL require comparable skills to be formulated. However, the number of these experts is limited, and the availability of easier XQuery “dialects” could be extremely valuable. Something similar happened with QBE, initially proposed as an alternative to SQL, that has then become popular as the user-friendly query language supported by MS Access. We designed and implemented XQBE, a visual dialect of XQuery that uses hierarchical structures, coherent with the hierarchical nature of XML, to denote the input and output documents.Our demo consists of examples of queries in XQBE and shows how our prototype allows to switch between equivalent representations of the same query.

Daniele Braga, Alessandro Campi, Stefano Ceri

P2P-DIET: One-Time and Continuous Queries in Super-Peer Networks

In peer-to-peer (P2P) systems a very large number of autonomous computing nodes (the peers) pool together their resources and rely on each other for data and services. P2P systems are application level virtual or overlay networks that have emerged as a natural way to share data and resources. The main application scenario considered in recent P2P data sharing systems is that of one-time querying: a user poses a query (e.g., “I want music by Moby”) and the system returns a list of pointers to matching files owned by various peers in the network. Then, the user can go ahead and download files of interest. The complementary scenario of selective dissemination of information (SDI) or selective information push is also very interesting. In an SDI scenario, a user posts a continuous query to the system to receive notifications whenever certain resources of interest appear in the system (e.g., when a song of Moby becomes available). SDI can be as useful as one-time querying in many target applications of P2P networks ranging from file sharing, to more advanced applications such as alert systems for digital libraries, e-commerce networks etc.

Stratos Idreos, Manolis Koubarakis, Christos Tryfonopoulos

HEAVEN: A Hierarchical Storage and Archive Environment for Multidimensional Array Database Management Systems

The intention of this paper is to present HEAVEN, a solution of intelligent management of large-scale datasets held on tertiary storage systems. We introduce the common state of the art technique storage and retrieval of large spatio-temporal array data in the High Performance Computing (HPC) area. An identified major bottleneck today is fast and efficient access to and evaluation of high performance computing results. We address the necessity of developing techniques for efficient retrieval of requested subsets of large datasets from mass storage devices. Furthermore, we show the benefit of managing large spatio-temporal data sets, e.g. generated by simulations of climate models, with Database Management Systems (DMBS). This means DBMS need a smart connection to tertiary storage systems with optimized access strategies. HEAVEN is based on the multidimensional array DBMS RasDaMan.

Bernd Reiner, Karl Hahn

OGSA-DQP: A Service for Distributed Querying on the Grid

OGSA-DQP is a distributed query processor exposed to users as an Open Grid Services Architecture (OGSA)-compliant Grid service. This service supports the compilation and evaluation of queries that combine data obtained from multiple services on the Grid, including Grid Database Services (GDSs) and computational web services. Not only does OGSA-DQP support integrated access to multiple Grid services, it is itself implemented as a collection of interacting Grid services. OGSA-DQP illustrates how Grid service orchestrations can be used to perform complex, data-intensive parallel computations. The OGSA-DQP prototype is downloadable from www.ogsadai.org.uk/dqp/. This demonstration aims to illustrate the capabilities of OGSA-DQP prototype via a GUI Client over a collection of bioinformatics databases and analysis tools.

M. Nedim Alpdemir, Arijit Mukherjee, Anastasios Gounaris, Norman W. Paton, Paul Watson, Alvaro A. A. Fernandes, Desmond J. Fitzgerald

T-Araneus: Management of Temporal Data-Intensive Web Sites

T-Araneus is a tool for the generation of Web sites with special attention to temporal aspects. It is based on the use of high-level models throughout the design process, and specifically on a logical model for temporal Web sites, T-ADM, which allows the definition of page-schemes with temporal aspects.

Paolo Atzeni, Pierluigi Del Nostro

τ-Synopses: A System for Run-Time Management of Remote Synopses

τ-Synopses is a system designed to provide a run-time environment for remote execution of various synopses. It enables easy registration of new synopses from remote platforms, after which the system can manage these synopses, including triggering their construction, rebuild and update, and invoking them for approximate query processing. The system captures and analyzes query workloads, enabling its registered synopses to significantly boost their effectiveness (efficiency, accuracy, confidence), by exploiting workload information for synopses construction and update. The system can also serve as a research platform for experimental evaluation and comparison of different synopses.

Yossi Matias, Leon Portman

AFFIC: A Foundation for Index Comparisons

Comparing indexes (UB-Tree, R*-Tree, etc.) on a structural and theoretical level provides upper bounds for their worst-case performance, but it does not reflect the performance in real world applications. Measuring just query executions times is not fair either, but differences may be caused just by bad implementations. Thus, a fair comparison has to take into account the algorithms, their implementation, real world data, data management and queries and I/O behavior as well as run time.Therefore, we need to take care that the implementations of indexes rely on a common foundation and differ just where there are conceptual differences. Furthermore, the foundation should be simple and flexible in order to allow adding new indexes easily while providing full control to all relevant parts. In our demo we present a small and flexible C++ library fulfilling these requirements. We expect to release this library in 2004 and thus provide a UB-Tree implementation to the database community useful for further research.

Robert Widhopf

Spatial Data Server for Mobile Environment

In this paper we present a spatial data server which provides not only interoperability but also rapid response for mobile environment with efficient ways, such as use of main memory to eliminate access time from disk on request, management of target data (i.e., GML) in the main memory to eliminate conversion time, spatial index to reduce search time, progressive transmission of spatial data to reduce response time, etc.

Byoung-Woo Oh, Min-Soo Kim, Mi-Jeong Kim, Eun-Kyu Lee

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter