Skip to main content

About this book

This volume comprises papers from the following ?ve workshops that were part of the complete program for the International Conference on Extending Database Technology (EDBT) held in Heraklion, Greece, March 2004: • ICDE/EDBT Joint Ph. D. Workshop (PhD) • Database Technologies for Handling XML-information on the Web (DataX) • Pervasive Information Management (PIM) • Peer-to-Peer Computing and Databases (P2P&DB) • Clustering Information Over the Web (ClustWeb) Together, the ?ve workshops featured 61 high-quality papers selected from appr- imately 180 submissions. It was, therefore, dif?cult to decide on the papers that were to beacceptedforpresentation. Webelievethattheacceptedpaperssubstantiallycontribute to their particular ?elds of research. The workshops were an excellent basis for intense and highly fruitful discussions. The quality and quantity of papers show that the areas of interest for the workshops are highly active. A large number of excellent researchers are working on the aforementioned ?elds producing research output that is not only of interest for other researchers but also for industry. The organizers and participants of the workshops were highly satis?ed with the output. The high quality of the presenters and workshop participants contributed to the success of each workshop. The amazing environment of Heraklion and the location of the EDBT conference also contributed to the overall success. Last, but not least, our sincere thanks to the conference organizers – the organizing team was always willing to help and if there were things that did not work, assistance was quickly available.

Table of Contents


Session I: March 18, 2004, Heraklion, Crete, Greece

Advanced Query Processing and Optimization

Querying Sliding Windows Over Online Data Streams

A data stream is a real-time, continuous, ordered sequence of items generated by sources such as sensor networks, Internet traffic flow, credit card transaction logs, and on-line financial tickers. Processing continuous queries over data streams introduces a number of research problems, one of which concerns evaluating queries over sliding windows defined on the inputs. In this paper, we describe our research on sliding window query processing, with an emphasis on query models and algebras, physical and logical optimization, efficient processing of multiple windowed queries, and generating approximate answers. We outline previous work in streaming query processing and sliding window algorithms, summarize our contributions to date, and identify directions for future work.

Lukasz Golab

MIRA: Multilingual Information Processing on Relational Architecture

In today’s global village, it is critical that the key information tools, such as web search engines,


portals and


, work across multiple natural languages, seamlessly. We propose a new flexible architecture –

Multilingual Information processing on Relational Architecture (MIRA)

– that supports the multilingual processing functionality of the primary storage mechanism for such deployments – the relational database systems, effectively and efficiently. We propose new linguistic matching operators that enhances the standard lexicographic matching of database systems into




domains. We further show that the performance of the systems may be made


. Our proposed architecture is based on standards and hence amenable for easy implementation in any type of query processing and information retrieval systems. In this paper, we present our approach to implement the above architecture and outline the host of research issues that are opened up due to the inherently fuzzy nature of the alternative matching semantics.

A. Kumaran

Semi-structured Data and XML

Index-Based Keyword Search in Mediator Systems

Many users and applications require the integration of semi-structured data from autonomous, heterogeneous Web sources. Over the last years mediator systems have emerged that use domain knowledge to overcome the problem of structural heterogeneity. However, many users of these systems do not have a thorough knowledge of the complex global schemas and of the comprehensive query languages. Consequently, easy-to-use query interfaces like keyword search and browsing have to be supported. The aim of the proposed PhD project is the index-based realization of keyword searches in concept-based mediator systems. In order to avoid unnecessary source queries an index structure is maintained on the global level and used during query planning and processing.

Ingolf Geist

Concept-Based Search on Semi-structured Data Exploiting Mined Semantic Relations

In this paper we show the current state of the ongoing research concerning our prototype for a search engine on semi-structured data incorporating rules mined on extracted structured data. We illuminate some ideas from the research field of data mining and how to apply them to the retrieval process. Additionally, we show technical aspects and features of our search engine.

Jens Graupmann

Data Model, Access Methods, and Control

Distributed and Scalable Similarity Searching in Metric Spaces

In this paper, we address the problem of scalable distributed similarity searching. Our work is based on single-site metric space indexing algorithms. They provide efficient way to perform range and nearest neighbor queries on arbitrary data in general metric spaces. The metric spaces are excellent abstraction that allows comparison of very complex objects (such as audio files, DNA sequences, texts). We have exploited the SDDS (Scalable and Distributed Data Structures) paradigms and P2P (Peer to Peer) systems to form a metric space similarity searching structure in distributed environment. Our proposed method is fully scalable without any centralized part and it allows performing similarity queries on stored data.

Michal Batko

Pattern Based Management: Data Models and Architectural Aspects

Patterns are concise, but rich in semantic, representation of data (data mining results, multimedia features, Web structural information, etc.). Only few approaches exist to deal with heterogeneous patterns, even if this is an hot topic in most data-intensive and distributed applications. The aim of this thesis is to investigate the potentiality of pattern-based management. In particular, we plan to define sufficiently expressive pattern data models and languages to cope with different types of patterns. Besides, investigation of architectural issues concerning the usage of the proposed model and languages in centralized and advanced distributed environments will be exploited.

Anna Maddalena


Managing Dynamic Repositories for Digital Content Components

The Semantic Web pursues interoperability at syntactic and semantic levels, to face the proliferation of data files with different purposes and representation formats. One challenge is how to represent such data, to allow users and applications to easily find, use and combine them. The paper proposes an infrastructure to meet those goals. The basis of the proposal is the notion of

digital content components

that extends the Software Engineering software component. The infrastructure offers tools to combine and extend these components, upon user request, managing them within dynamic repositories. The infrastructure adopt XML and RDF standards to foster interoperability, composition, adaptation and documentation of content data. This work was motivated by reuse needs observed in two specific application domains: education and agro-environmental planning.

André Santanchè, Claudia Bauzer Medeiros

Semantic Web Recommender Systems

Research on recommender systems has primarily addressed centralized scenarios and largely ignored open, decentralized systems where remote information distribution prevails. The absence of superordinate authorities having full access and control introduces some serious issues requiring novel approaches and methods. Hence, our primary objective targets the successful deployment and integration of recommender system facilities for Semantic Web applications, making use of novel technologies and concepts and incorporating them into one coherent framework.

Cai-Nicolas Ziegler

Session II: March 29, 2004, Boston, MA, USA

Information Hiding and Trust Negotiation

Trust Negotiation Systems

Trust negotiation is a promising approach for establishing trust in open systems like the Internet, where sensitive interactions may often occur between entities with no prior knowledge of each other. In this paper we present the work carried on so far in the area of trust negotiation as part of the Ph.D. activity. More precisely, we describe Trust-


, the XML based framework for trust negotiations we have developed, and emphasize the open issues to be covered.

Anna Cinzia Squicciarini

Advanced Query Processing and Optimization

Continuous Query Processing in Spatio-Temporal Databases

In this paper, we aim to develop a framework for continuous query processing in spatio-temporal databases. The proposed framework distinguishes itself from other query processors by employing two main paradigms: (1) Scalability in terms of the number of concurrent continuous spatio-temporal queries. (2) Incremental evaluation of continuous spatio-temporal queries. Scalability is achieved thorough employing a

shared execution

paradigm. Incremental evaluation is achieved through computing only the updates to the previously reported answer. We distinguish between two types of updates;


updates and


updates. Positive or negative updates indicate that a certain object should be added to or removed from the previously reported answer, respectively. The proposed framework is applicable to a wide variety of continuous spatio-temporal queries where we do not have any constraints about the mutability of objects and queries (i.e., both objects and queries can be either stationary or moving) or the movement representation (i.e., movement can be represented either by sampling or trajectory).

Mohamed F. Mokbel

Load Distribution for Distributed Stream Processing

Distributed steam processing is necessary for a large class of stream-based applications. To exploit the full power of distributed computation, effective load distribution techniques must be developed to optimize the system performance and cope with time-varying loads. When traditional load balancing or load sharing strategies are applied to such systems, we find that they either fall short in achieving good load distribution or fail to maintain good task partition in the long run.

In this paper, we study two important issues of dynamic load distribution in the context of data-intensive stream processing. The first one is how to allocate processing resources for push-based tasks such that the average end-to-end data processing latency can be minimized. The second issue is how to maintain a good load distribution dynamically for long running continuous queries. We propose a new hybrid load distribution strategy that addresses the above concerns by load clustering. To achieve scalability, our algorithm is completely decentralized and asynchronous.

Ying Xing

XML Query Processing and Optimization

In this paper, I summarize my research on optimizing XML queries. This work has two components: the first component is the definition of a logical algebra and logical optimization techniques. This algebra can be translated into different physical algebras such as native or extended-relational algebras. The second component is the design of physical storage structures and physical optimization techniques.

Ning Zhang

Data Models and Database Design

An Access Structure for Similarity Search in Metric Spaces

Similarity retrieval is an important paradigm for searching in environments where exact match has little meaning. Moreover, in order to enlarge the set of data types for which the similarity search can efficiently be performed, the mathematical notion of metric space provides a useful abstraction of similarity. In this paper, we present a novel access structure for similarity search in arbitrary metric spaces, called D-Index. D-Index supports easy insertions and deletions and bounded search costs for range queries with radius up to


. D-Index also supports disk memories, thus, it is able to deal with large archives. However, the partitioning principles employed in the D-Index are not very optimal since they produce high number of empty partitions. We propose several strategies of partitioning and, finally, compare them.

Vlastislav Dohnal

Incremental Read-Aheads

In spite of the advances in caching, query optimization, and object persistence techniques in the past few years, the cost of interactions of large-scale data-intensive applications with a relational database where the persistent objects are implemented remains a performance bottleneck. To reduce the cost of such interactions, we present a read-ahead scheme, which allows the application to reduce the number of database roundtrips by retrieving the data before it is actually needed by the transactions in the applications.

We focus on designing generic rules for determining the efficient sequences of SQL statements for read-ahead queries on relational databases, such that the rules would be useful across application domains and data-access patterns. This paper explains our research methodology for generating generic access patterns and studying the parameters that influence the costs of various combinations of read-ahead SQL statements that implement the generic access patterns of applications.

A. Soydan Bilgin

RAM: A Multidimensional Array DBMS

Application areas beyond (simple) administrative tasks are dominated by custom built solutions. Tasks like multimedia analysis require a view on data different from that offered by most database management systems: the set based data model may no longer suffice.

To address this issue we introduce RAM: a multidimensional array database system. The concept of an array database system is not new, however our approach differs from earlier work in that we realize this new view on data by mapping it onto a traditional relational schema. This approach unites the strengths of existing database systems with the added benefits of bulk array processing.

Alex R. van Ballegooij

Information Retrieval and Interoperability

Handling Inconsistencies in Data Warehouses

Data warehouses (DWs) can become inconsistent when some dimensional constraints are not satisfied by the dimension instances. In this paper, we present preliminary results about the effects of the violation of partitioning constraints in homogeneous dimension instances over aggregation queries, and in particular over the summarizability property (SUMM) of the DWs. We are interested in finding ways to retrieve consistent answers even when the DW is inconsistent. We give a notion of repair for inconsistent instances based on a notion of prioritized minimization. We also describe a notion of consistent answer in DWs.

Mónica Caniupán

Data Sharing and Querying for Peer-to-Peer Data Management Systems

In this work, we investigate mechanisms to support data sharing and querying in a

peer-to-peer data management system

, that is, a peer-to-peer system where each peer manages its own data. To support data sharing, we propose the use of mapping tables which list pairs of corresponding data values that reside in different peers. Our work illustrates how automated tools can help manage the tables between multiple peers by inferring new tables from existing ones and by checking their consistency. In terms of querying, we propose a framework in which users pose queries only with respect to their local peer. Then, we provide a rewriting mechanism that uses mapping tables to translate a locally expressed query to a set of queries over the acquainted peers.

Anastasios Kementsietsidis

Relevance Feedback in XML Retrieval

Highly heterogeneous XML data collections that do not have a global schema, as arising, for example, in federations of digital libraries or scientific data repositories, cannot be effectively queried with XQuery or XPath alone, but rather require a ranked retrieval approach. As known from ample work in the IR field, relevance feedback provided by the user that drives automatic query refinement or expansion can often lead to improved search result quality (e.g., precision or recall). In this paper we present a framework for feedback-driven XML query refinement and address several building blocks including reweighting of query conditions and ontology-based query expansion. We point out the issues that arise specifically in the XML context and cannot be simply addressed by straightforward use of traditional IR techniques, and we present our approaches towards tackling them.

Hanglin Pan

Database Technologies for Handling XML-Information on theWeb (DataX)

Invited Paper

XML Challenges for the Database Community: Past, Present, and Future

It is a common understanding that “the Web changes everything”, and this applies to an increasing number of computer applications. In the last years, we also understood that “XML is the means” for this change to take effect, as XML has progressively become the basis on which Web applications are built. This corollary was the claim of [9], published four years ago, in which XML was pointed out as a prominent and promising research direction for the database community.

Indeed, among the many promises that followed the advent of XML, some have been fulfilled, some are still at the stage of promise. After four years, one of the authors was asked to “redo” the same exercise, reconsidering the challenges posed to the database community and making another assessment of the area checking whether the expectations and challenges of the year 2000 have been met in 2004. This paper compares the past and present status of several XML-related research activities, examining the “next steps” that should lead to further integration of XML with database technology in terms of language and system requirements.

Daniele Braga, Alessandro Campi, Stefano Ceri

Indexing XML Documents

L-Tree: A Dynamic Labeling Structure for Ordered XML Data

With the ever growing use of XML as a data representation format, we see an increasing need for robust, high performance XML database systems. While most of the recent work focuses on efficient XML query processing, XML databases also need to support efficient updates. To speed up query processing, various labeling schemes have been proposed. However, the vast majority of these schemes have poor update performance. In this paper, we introduce a dynamic labeling structure for XML data: L-Tree and its order-preserving labeling scheme with O(log n) amortized update cost and O(log n) bits per label. L-Tree has good performance on updates without compromising the performance of query processing. We present the update algorithm for L-Tree and analyze its complexity.

Yi Chen, George Mihaila, Rajesh Bordawekar, Sriram Padmanabhan

Implementation of XPath Axes in the Multi-dimensional Approach to Indexing XML Data

XML (Extensible Mark-up Language) has been recently understood as a new approach to data modelling. An implementation of a system enabling us to store and query XML documents efficiently requires the development of new techniques which make it possible to index an XML document in a way that provides an efficient evaluation of a user query. Most XML query languages are based on the language XPath and use a form of path expressions for composing more general queries. XPath defines a family of 13 axes, i.e. relationship types in which an actual element can be associated to other elements in the XML tree. Previously published multi-dimensional approaches to indexing XML data use paged and balanced multi-dimensional data structures like UB-trees and R


-trees. In this paper we revise the approaches and introduce a novel approach to the implementation of an XPath subset.

Michal Krátký, Jaroslav Pokorný, Václav Snášel

Dynamic Range Labeling for XML Trees

Structural joins based on range labeling schemes are considered as one of the most important topics in studies on XML query processing. When an XML data set is updated, however, the nodes have to be relabeled in order to keep their order relationship. Costly bulk node relabeling should be avoided to allow for continuous processing of queries for dynamic XML trees that are updated often. In this paper, we propose two dynamic node labeling schemes to avoid “gap shortfalls”. One is simple local relabeling scheme and the other is more sophisticated in that it uses approximate histograms that keep the statistics of the update operations. These two techniques allow node labels to be managed dynamically and locally. Experiments show that they can avoid bulk relabeling while still permitting update operations.

Takeharu Eda, Yasushi Sakurai, Toshiyuki Amagasa, Masatoshi Yoshikawa, Shunsuke Uemura, Takashi Honishi

FliX: A Flexible Framework for Indexing Complex XML Document Collections

While there are many proposals for path indexes on XML documents, none of them is perfectly suited for indexing large-scale collections of interlinked XML documents. Existing strategies lack support for links, require large amounts of time to build or space to store the index, or cannot efficiently answer connection queries. This paper presents the


framework for connection indexing that supports large, heterogeneous document collections with links, using the existing path indexes as building blocks. We introduce some example configurations of the framework that are appropriate for many important application scenarios. Experiments show the feasibility of our approach.

Ralf Schenkel

Querying XML Documents and Schema Evolution

A Statistical Approach for XML Query Size Estimation

The increasing number of XML repositories has intensified research activities in the optimization of XML queries. The success of any optimization approach hinges on an accurate query size estimation. This paper presents a statistical method for estimating the result size of XML queries. Our estimation system extracts two summarized information, namely, node ratio and node factor, from every distinct parent-child path in the XML files. Experiment results indicate that our approach requires small memory footprint, and yet proves to be sufficient in estimating the result size of queries under the data-independent assumption.

Mong Li Lee, Hanyu Li, Wynne Hsu, Beng Chin Ooi

Summarizing XML Data by Means of Association Rules

XML is a rather verbose representation of semistructured data, which may require huge amounts of storage space. We propose several summarized representations of XML data, which can both provide succinct information and be directly queried. These representations are based on the extraction of association rules from XML datasets.

Elena Baralis, Paolo Garza, Elisa Quintarelli, Letizia Tanca

Prune XML Before You Search It: XML Transformations for Query Optimization

In this paper, we present a query optimization approach for XML document management environments loosely-coupled with the storage system. Our technique is based on two main steps: first, based on the query, input documents are transformed, by pruning parts that are irrelevant with respect to the query; then the query is executed on the pruned documents. An index structure is also provided to further optimize the pruning process. Experimental results show that, by using our pruning strategy, query execution time can be significantly reduced.

Stéphane Bressan, Zoé Lacroix, Ying Guang Li, Anna Maddalena

Keeping Pace with Evolving XML-Based Specifications

Standards and specifications are essential to organizations that require exchange of information with other parties. XML is poised to be the definitive language of choice in Internet and Intranet applications and IT specifications are increasingly adopting XML schema. However, many specifications are still evolving. In order to cope with such rapid changes, extensions to the XML Schema Language are proposed. The objective is to ensure compatibility between different versions of standards. A framework which utilizes the XML Schema Language extensions is further proposed.

Marvin Tan, Angela Goh

XML Application

Knowledge Management Framework for the Collaborative Distribution of Information

The management of multimedia documents, which includes storage, indexing, query optimization, and distribution, requires very strong frameworks and policies in order to ensure its efficiency and reliability. There has been much work done in the fields of databases, information retrieval, relevance evaluation, and transaction processing for any kind of data; but these four domains are too often considered separately. This definitely causes a lack of comprehensive distributed information management. We are convinced that much benefit can be obtained by apprehending a global vision of the whole management process through XML. In fact, we want to provide technologies that improve the access to heterogeneous and distributed sources of information for people sharing common interests. In order to deal with detailed and consistent information, we have to consider several layers of metadata related to users, communities, devices, and data sources. This categorized information has thereafter to be handled and efficiently used by applications combining processes from the four operative domains. Then, we claim that it is possible to offer users an appropriate viewpoint of the data, i.e. a personalized, effective, and very accurate access to the information.

Jérôme Godard, Frédéric Andrès, William Grosky, Kinji Ono

XML-Based Revocation and Delegation in a Distributed Environment

The rapid increase on the circulation of data over the web has highlighted the need for distributed storage of Internet-accessible information due to the rapid increase on the circulation of data over the web. Thus, access control mechanisms should also be distributed in order to protect them effectively. A recent idea in the access control theory is the delegation and revocation of rights, i.e. the passing over of one clients rights to the other and vice versa. Here, we propose an XML-based distributed delegation module which can be integrated into a distributed role-based access control mechanism protecting networks. The idea of X.509v3 certificates is used for the transfer of authorization information referring to a client. The modules are XML-based and all of the associated data structures are expressed through Document Type Definitions (DTDs).

Konstantina Stoupa, Athena Vakali, Fang Li, Ioannis Tsoukalas

Using a Transcode and Prefetch Method for Playing XML Contents Containing Multiple Multimedia Data on Mobile Terminals

With the recent development in mobile communication, Internet connection via mobile terminals became possible. Therefore, in order for data exchange between mutually different terminal types, XML-based contents are widely being used. However, there are limitations in using mobile terminals to achieve real-time playback of XML-based contents including various multimedia data. Among them, two main limitations are discussed in this study. The first limitation is the utilization of low communication bandwidth in mobile terminals. That is, the bandwidth of mobile terminals is too low (24kbps ~ 1.2Mbps) to accommodate the high bandwidth required in multimedia data. The second limitation is the restricted memory space in mobile terminals. Playback of XML-based contents in such mobile terminals becomes restrictive when utilizing only the existing transcoding and prefetch technique. Therefore, this study attempts at proposing the method of minimizing the initial delay time prior to playback when XML contents to be displayed in mobile terminals are consisted of a number of data. In order for this, a stream selection policy was proposed to transcode data streams included in XML contents, rather than to transcode all of the data streams. In addition, a technique has been suggested to reorganize the transcoded data streams, to evaluate whether it is possible to playback the reorganized data streams in the mobile terminal, thus to prefetch the data.

Maria Hong, Dong-Yeop Ryu, Jae-Chul Sir, Eun-Young Kim, Young-Hwan Lim


What’s Next in XML and Databases?

Since the time XML became a W3C standard for document representation and exchange over the Web, many efforts have been devoted to the development of standards, methodologies, and tools for handling, storing, retrieving, and protecting XML documents. The purpose of this panel, held during the international EDBT’2004 workshop on “

database technologies for handling XML information on the Web

” [3], is to discuss the current status of the research in XML data management and to foresee new trends towards the XML-ization of database research.

Minos Garofalakis, Ioana Manolescu, Marco Mesiti, George Mihaila, Ralf Schenkel, Bhavani Thuraisingham, Vasilis Vassalos

Pervasive Information Management (PIM)

Modeling Contextual Information Using Active Data Structures

In this paper, we present some ideas how to represent, handle, and integrate semantically expressive contextual information. Our approach enhances topic maps by more topic and association types and embedded code. Our data model is a comprehensive picture combining the knowledge of many context-aware applications. It is capable of performing automatic updates of the contained contextual information, computes higher-level information by itself, and routes context events to client applications. We present a context-aware front end of a car navigation system that uses our data model.

Kevin Goslar, Alexander Schill

Context- and Situation-Awareness in Information Logistics

In order to deliver relevant information at the right time to its mobile users, systems such as event notification systems need to be aware of the users’ context, which includes the current time, their location, or the devices they use. Many context frameworks have been introduced in the past few years. However, they usually do not consider the notion of characteristic features of contexts that are invariant during certain time intervals. Knowing the current situation of a user allows the system to better target the information to be delivered. This paper presents a model to handle various contexts and situations in information logistics. A context is defined as a collection of values usually observed by sensors, eg., location or temperature. A situation builds on this concept by introducing semantical aspects defined in an ontology. Our

situation awareness

proposal has been tested in two projects.

Ulrich Meissen, Stefan Pfennigschmidt, Agnès Voisard, Tjark Wahnfried

An Indexing Scheme for Update Notification in Large Mobile Information Systems

Due to the increasing usage of small and low footprinted devices like mobile phones as clients of mobile information systems a new problem arises: “How to determine the relevance of updates for a large number of mobile clients?” In this paper we present an indexing scheme that represents conjunctive queries posed by the mobile clients in a trie. So, IDs of the clients are referenced by their queries and checking the relevance of an update can be efficiently done by a trie lookup.

Hagen Höpfner, Stephan Schosser, Kai-Uwe Sattler

A Mobile Agents Based Architecture for the Distributed Processing of Continuous Location Queries in a Wireless Environment: Performance Evaluation

With the current advances of mobile computing technology, we are witnessing an explosion in the development of applications that provide mobile users with a wide range of services. Of special interest are those applications that exploit the particular features of mobile environments to provide the user with context-aware information. In particular, we focus on location-dependent queries, which are still a subject of research mainly due to the lack of an architecture that is well-adapted to deal with continuous location queries in an efficient way.

In this paper we present the distributed architecture, based on mobile agents, that we propose to process continuous location-dependent queries in mobile environments. We then evaluate our proposal, showing that the system achieves a good precision and scales up well.

Sergio Ilarri, Eduardo Mena, Arantza Illarramendi

Improving Usability of Location-Based Services with User-Centric Data Querying

In this paper, we describe a computational model that produces a dataset which we consider to be an appropriate response for “give me a description of my neighborhood” type of queries in location-based services. Our attempt is to reflect the abstraction ability of a specific user, and in this way to maximize the usability of the service under restrictions on data volume posed by technical, economical and cognitive factors of mobile environments. We also present the results of instantiating the model for a simple LBS.

Artem Katasonov

MITOS: A Smart Spaces System for Pervasive Computing

The popularity of wireless networks has increased in the recent years, as they become a common addition to fixed LAN infrastructures. In this paper we introduce a novel use for a wireless network based on the IEEE 802.11 standard: a smart spaces system that proposes to wireless users to assume optimum locations in order to obtain a satisfactory QoS level when the wireless hot-spot becomes saturated. The system, being kept up to date with the traffic across the network and the location of each user, is capable of making the appropriate proposal urging users to move to new locations at reasonable distances. In that way, user perceived QoS and wireless network load-balancing can be both achieved. MITOS is the name of the developed smart spaces system. We discuss the general system architecture and operation as well as the optimum proposal algorithm. The operation of the system was verified in the wireless infrastructure of our department.

George Alyfantis, Stathes Hadjiefthymiades, Lazaros Merakos

Service Allocation in Selfish Mobile Ad hoc Networks Using Vickrey Auction

Incentive scheme for stimulating service provision in Mobile Ad hoc NETworks (MANET) has been under intensive investigation due to its significance to the operation of MANET. This paper applies distributed algorithmic mechanism design and utilizes Vickrey auction for service allocation in mobile ad hoc networks. We show that our method stimulates service provision and achieves desired system-wide service allocation in spite of each agent’s selfish behavior, while introducing challenges from the inherent shortcomings of Vickrey auction and characteristics of MANET. We discuss the challenges, the existing solutions for wireline networks and propose a system model for service allocation in MANET.

Jinshan Liu, Valérie Issarny

Engineering Incentive Schemes for Ad hoc Networks

A Case Study for the Lanes Overlay

In ad hoc networks, devices have to cooperate in order to compensate for the absence of infrastructure. Yet, autonomous devices tend to abstain from cooperation in order to save their own resources. Incentive schemes have been proposed as a means of fostering cooperation under these circumstances. In order to work effectively, incentive schemes need to be carefully tailored to the characteristics of the cooperation protocol they should support. This is a complex and demanding task. However, up to now, engineers are given virtually no help in designing an incentive scheme. Even worse, there exists no systematic investigation into which characteristics should be taken into account and what they imply. Therefore, in this paper, we propose a systematic approach for the engineering of incentive schemes. The suggested procedure comprises the analysis and adjustment of the cooperation protocol, the choice of appropriate incentives for cooperation, and guidelines for the evaluation of the incentive scheme. Finally, we show how the proposed procedure is successfully applied to a service discovery overlay.

Philipp Obreiter, Birgitta König-Ries, Georgios Papadopoulos

The Case for Mobile OLAP

On-Line Analytical Processing (OLAP) is a trend in database technology, based on the multidimensional view of data. Numerous applications and development environments exist, offering OLAP analysis to developers, analysts and end users. Further more, mobile devices find their way into almost all aspects of every day life, ranging from cellular phones with advanced capabilities (smart phones) for personal use to professional PDA’s used for numerous wireless applications – OLAP included – and thus generating the need for convergence of the two technologies. In this paper we introduce the term

Mobile OLAP

to express the porting of requirements and specifications for OLAP applications into the wireless and mobile computing world. More specifically, we examine the requirements posed for OLAP visualization applications on mobile devices, both in terms of presentation and usability. We then proceed by investigating the state of the art of the broader wireless computing community, focusing in the specifications more close to OLAP. Finally, we demonstrate how a “lightweight” presentation model for OLAP can significantly contribute towards mobile OLAP.

Andreas S. Maniatis

Peer-to-Peer Computing and Databases (P2P&DB)


On Constructing Small Worlds in Unstructured Peer-to-Peer Systems

Peer-to-peer systems have evolved as a means to share large amounts of data among autonomous nodes. A central issue in this context is locating nodes with data matching a user query. In this paper, we consider building peer-to-peer systems with small-world properties, that is, connecting the nodes to each other so that: (i) the distance between any two nodes is small and (ii) relevant nodes are connected to each other. Relevance between nodes is defined based on the probability that the two nodes match similar queries. We propose decentralized procedures for constructing small worlds based on routing indexes that describe the content of neighboring nodes. Our experimental results show that small-world peer-to-peer systems built with these procedures increase recall, that is, the percentage of relevant results returned.

Yannis Petrakis, Evaggelia Pitoura

SPROUT: P2P Routing with Social Networks

In this paper, we investigate how existing social networks can benefit P2P data networks by leveraging the inherent trust associated with social links. We present a trust model that lets us compare routing algorithms for P2P networks overlaying social networks.We propose SPROUT, a DHT routing algorithm that, by using social links, significantly increases the number of query results and reduces query delays.We discuss further optimization and design choices for both the model and the routing algorithm. Finally, we evaluate our model versus regular DHT routing and Gnutella-like flooding.

Sergio Marti, Prasanna Ganesan, Hector Garcia-Molina

A Simulation Framework for Schema-Based Query Routing in P2P-Networks

Current simulations of P2P-networks don’t take any kind of schemas into account. We present a simulation-framework and first results for query routing based on extensible schema information to describe peer content, providing more value than simple categorizations like the filename as abstraction for an MP3-song. Using different parameterization, we compare the impact of introducing the HyperCuP-topology in a P2P-network for routing and possible clustering in super-peers and discuss first simulation results.

Wolf Siberski, Uwe Thaden

Query Evaluation

A Distributed Algorithm for Robust Data Sharing and Updates in P2P Database Networks

In this paper we thoroughly analyze a distributed procedure for the problem of local database update in a network of database peers, useful for data exchange scenarios. The algorithm supports dynamic networks: even if nodes and coordination rules appear or disappear during the computation, the proposed algorithm will eventually terminate with a sound and complete result.

Enrico Franconi, Gabriel Kuper, Andrei Lopatenko, Ilya Zaihrayeu

XPeer : A Self-Organizing XML P2P Database System

This paper describes XPeer , a


system for sharing and querying XML data. The system allows users to share XML data without significant human intervention, and to pose XQuery FLWR queries against them. The proposed system can be used in any application field, being a general purpose XML p2p DBMS, even though its main application is the management of resource descriptions in



Carlo Sartiani, Paolo Manghi, Giorgio Ghelli, Giovanni Conforti

Bit Zipper Rendezvous Optimal Data Placement for General P2P Queries

In many distributed applications, pairs of queries and values are evaluated by participating nodes. This includes keyword search for documents, selection queries on tuples, and publish-subscribe. These applications require that all values accepted by the query be evaluated. To carry out this evaluation we will present the peer-to-peer based Bit Zipper Rendezvous which partitions query-value pairs as opposed to values only. Even for problems that allow an efficient value-based partition, the Bit Zipper complements existing solutions with its generality. Where flooding to


nodes used to be the only fall-back, the Bit Zipper is a replacement needing only


. For problems requiring that all pairs be evaluated, we will show that the Bit Zipper Rendezvous is optimal.

Wesley W. Terpstra, Stefan Behnel, Ludger Fiege, Jussi Kangasharju, Alejandro Buchmann

Query Answering in Peer-to-Peer Data Exchange Systems

The problem of answering queries posed to a peer who is a member of a peer-to-peer data exchange system is studied. The answers have to be consistent wrt to both the local semantic constraints and the data exchange constraints with other peers; and must also respect certain trust relationships between peers. A semantics for

peer consistent answers

under exchange constraints and trust relationships is introduced and some techniques for obtaining those answers are presented.

Leopoldo Bertossi, Loreto Bravo

Semantic Query Routing and Processing in P2P Database Systems: The ICS-FORTH SQPeer Middleware

Peer-to-peer (P2P) computing is currently attracting enormous attention. In P2P systems a very large number of autonomous computing nodes (the peers) pool together their resources and rely on each other for data and services. More and more P2P data management systems rely nowadays on intensional (i.e. schema) information for integrating and querying peer bases. Such information can be easily captured by emerging Semantic Web languages such as RDF/S. However, a fully-fledged framework for evaluating semantic queries over peer RDF/S bases (materialized or virtual) is missing. In this paper we present the ICS-FORTH SQPeer middleware for routing and processing RQL queries and RVL views. The novelty of SQPeer lies on the use of intensional active schemas for determining relevant peer bases, as well as, constructing distributed query plans. In this context, we consider optimization opportunities for SQPeer query plans.

George Kokkinidis, Vassilis Christophides

Query Processing in Super-Peer Networks with Languages Based on Information Retrieval: The P2P-DIET Approach

This paper presents P2P-DIET, an implemented resource sharing system that unifies one-time and continuous query processing in super-peer networks. P2P-DIET offers a simple data model for the description of networkresources based on attributes with values of type text and a query language based on concepts from Information Retrieval. The focus of this paper is on the main modelling concepts of P2P-DIET (metadata, advertisements and queries), the routing algorithms (inspired by the publish/subscibe system SIENA) and the scalable indexing of resource metadata and queries.

Stratos Idreos, Christos Tryfonopoulos, Manolis Koubarakis, Yannis Drougas

Data Structures

A Query-Adaptive Partial Distributed Hash Table for Peer-to-Peer Systems

The two main approaches to find data in


(P2P) systems are


networks using flooding and


networks using a distributed index. A distributed index is usually built over

all keys

that are stored in the network whether they are queried or not. Indexing all keys is no longer feasible when indexing


, as the key space becomes very large. Here we need a query-adaptive approach that indexes only keys worth indexing, i.e. keys that are queried at least with a certain frequency. In this paper we study the cost of indexing and propose a query-adaptive

partial distributed hash table

(PDHT) that does not keep all keys in the index. We model and analyze a scenario to show that query-adaptive partial indexing outperforms pure flooding and “index-everything” strategies. Furthermore, our scheme is able to automatically adjust the index to changing query frequencies and distributions.

Fabius Klemm, Anwitaman Datta, Karl Aberer

P2PR-Tree: An R-Tree-Based Spatial Index for Peer-to-Peer Environments

The unprecedented growth and increased importance of geographically distributed spatial data has created a strong need for efficient sharing of such data. Interestingly, the ever-increasing popularity of peer-to-peer (P2P) systems has opened exciting possibilities for such sharing. This motivates our investigation into spatial indexing in P2P systems. While much work has been done towards expediting search in file-sharing P2P systems, issues concerning spatial indexing in P2P systems are significantly more complicated due to overlaps between spatial objects and the complexity of spatial queries. Incidentally, existing R-tree-based structures for distributed environments (e.g., the MC-Rtree) are


adequate for addressing the sheer scale, dynamism and heterogeneity of P2P environments. Hence, we propose the P2PR-tree (Peer-to-Peer R-tree), which is a new spatial index specifically designed for P2P systems. The main features of P2PR-tree are two-fold. First, it is hierarchical and performs efficient pruning of the search space by maintaining minimal amount of information concerning peers that are far away and storing more information concerning nearby peers, thereby optimizing disk space usage. Second, it is completely decentralized, scalable and robust to peers joining/leaving the system. The results of our performance evaluation demonstrate that it is indeed practically feasible to share spatial data in a P2P system and that P2PR-tree is able to outperform MC-Rtree significantly.

Anirban Mondal, Yi Lifu, Masaru Kitsuregawa

Clustering Information Over theWeb (ClustWeb)

Invited Paper

Web Data Protection: Principles and Research Issues

Protection of web documents is a challenging task which requires to address several issues, ranging from the development of suitable policy languages to policy enforcement. In this paper, we discuss the main problems that need to be faced in providing a comprehensive framework for securing web documents and outline possible techniques and mechanisms that can be adopted. Additionally, we discuss research trends in the field and we show the relations that exist between web data protection and clustering information over the web.

Elena Ferrari

Web Content Clustering

Clustering Structured Web Sources: A Schema-Based, Model-Differentiation Approach

The Web has been rapidly “deepened” with the prevalence of databases online. On this “deep Web,” numerous sources are


, providing schema-rich data. Their schemas define the

object domain

and its

query capabilities

. This paper proposes clustering sources by their

query schemas

, which is critical for enabling both

source selection


query mediation

, by organizing sources of with similar query capabilities. In abstraction, this problem is essentially clustering categorical data (by viewing each query schema as a transaction). Our approach hypothesizes that “homogeneous sources” are characterized by the same hidden generative models for their schemas. To find clusters governed by such statistical distributions, we propose a novel objective function,


, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters. Our evaluation shows that, on clustering the Web query schemas, the model-differentiation function outperforms existing ones with the hierarchical agglomerative clustering algorithm.

Bin He, Tao Tao, Kevin Chen-Chuan Chang

Clustering XML Documents Using Structural Summaries

This work presents a methodology for grouping structurally similar XML documents using clustering algorithms. Modeling XML documents with tree-like structures, we face the ‘clustering XML documents by structure’ problem as a ‘tree clustering’ problem, exploiting distances that estimate the similarity between those trees in terms of the hierarchical relationships of their nodes. We suggest the usage of tree structural summaries to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Experimental results are provided using a prototype testbed.

Theodore Dalamagas, Tao Cheng, Klaas-Jan Winkel, Timos Sellis

A Scalable Randomized Method to Compute Link-Based Similarity Rank on the Web Graph

Several iterative hyperlink-based similarity measures were published to express the similarity of web pages. However, it usually seems hopeless to evaluate complex similarity functions over large repositories containing hundreds of millions of pages.We introduce scalable algorithms computing SimRank scores, which express the contextual similarities of pages based on the hyperlink structure. The proposed methods scale well to large repositories, fulfilling strict requirements about computational complexity. The algorithms were tested on a set of ten million pages, but parallelization techniques make it possible to compute the SimRank scores even for the entire web with over 4 billion pages. The key idea is that randomized Monte Carlo methods combined with indexing techniques yield a scalable approximation of SimRank.

Dániel Fogaras, Balázs Rácz

Web Logs and Selective Clustering

A Framework for Cluster Management

To represent and manipulate data extracted from Web-server logs or applicational logs, clustering techniques can be used. The generated clusters are often different in types, are generated by using different algorithms, and should be homogeneously manipulated together with other knowledge mined from data, for example association rules or decision trees. The problem thus arises of using an homogeneous framework in which cluster results can be represented and manipulated, possibly together with other data mining results. In this paper, we show how the pattern modeling framework presented in [5,10] can be used to this purpose.

Barbara Catania, Anna Maddalena

Diχeminator: A Profile-Based Selective Dissemination System for XML Documents

Current approaches for the selective dissemination of XML documents are not suitable for an automatic adaptation of user profiles to her current preferences because either they rely on user preferences specified by filling up forms or they require to process a high number of documents. In this paper we present the architecture of Di


eminator, a selective dissemination system for XML documents based on profiles. Profiles, represented through XML Schema, concisely represent the kind of documents a user subscribing the service is interested in. Profiles are used for filtering out irrelevant documents relying on user preferences. Moreover, profiles are kept up to date taking into account the documents the user effectively accesses or refuses.

Elisa Bertino, Giovanna Guerrini, Marco Mesiti

Query Recommendation Using Query Logs in Search Engines

In this paper we propose a method that, given a query submitted to a search engine, suggests a list of related queries. The related queries are based in previously issued queries, and can be issued by the user to the search engine to tune or redirect the search process. The method proposed is based on a query clustering process in which groups of semantically similar queries are identified. The clustering process uses the content of historical preferences of users registered in the query log of the search engine. The method not only discovers the related queries, but also ranks them according to a relevance criterion. Finally, we show with experiments over the query log of a search engine the effectiveness of the method.

Ricardo Baeza-Yates, Carlos Hurtado, Marcelo Mendoza


An Overview of Web Data Clustering Practices

Clustering is a challenging topic in the area of Web data management. Various forms of clustering are required in a wide range of applications, including finding mirrored Web pages, detecting copyright violations, and reporting search results in a structured way. Clustering can either be performed once offline, (independently to search queries), or online (on the results of search queries). Important efforts have focused on mining Web access logs and to cluster search engine results on the fly. Online methods based on link structure and text have been applied successfully to finding pages on related topics. This paper presents an overview of the most popular methodologies and implementations in terms of clustering either Web users or Web sources and presents a survey about current status and future trends in clustering employed over the Web.

Athena Vakali, Jaroslav Pokorný, Theodore Dalamagas


Additional information

Premium Partner

    Image Credits