Skip to main content

2009 | Buch

Database and Expert Systems Applications

20th International Conference, DEXA 2009, Linz, Austria, August 31 – September 4, 2009. Proceedings

herausgegeben von: Sourav S. Bhowmick, Josef Küng, Roland Wagner

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 20th International Conference on Database and Expert Systems Applications, DEXA 2009, held in Linz, Austria, in August/September 2009. The 35 revised full papers and 35 short papers presented were carefully reviewed and selected from 202 submissions. The papers are organized in topical sections on XML and databases; Web, semantics and ontologies; temporal, spatial, and high dimensional databases; database and information system architecture, performance and security; query processing and optimisation; data and information integration and quality; data and information streams; data mining algorithms; data and information modelling; information retrieval and database systems; and database and information system architecture and performance.

Inhaltsverzeichnis

Frontmatter

Invited Talk

Management of Information Supporting Collaborative Networks

Dynamic creation of opportunity-based goal-oriented Collaborative Networks (CNs), among organizations or individuals, requires the availability of a variety of up-to-date information. In order to effectively address the complexity, dynamism, and scalability of actors, domains, and operations in opportunity-based CNs, pre-establishment of properly administrated strategic CNs is required. Namely, to effectively support creation/operation of opportunity-based VOs (Virtual Organizations) operating in certain domain, the pre-establishment of a VBE (Virtual organizations Breeding Environment) for that domain plays a crucial role and increases their chances of success. Administration of strategic CN environments however is challenging and requires an advanced set of inter-related functionalities, developed on top of strong management of their information. With the emphasis on information management aspects, a number of generic challenges for the CNs and especially for the administration of VBEs are introduced in the paper.

Hamideh Afsarmanesh, Luis M. Camarinha-Matos

XML and Databases I

A Low-Storage-Consumption XML Labeling Method for Efficient Structural Information Extraction

Recently, labeling methods to extract and reconstruct the structural information of XML data, which are important for many applications such as XPath query and keyword search, are becoming more attractive. To achieve efficient structural information extraction, in this paper we propose C-DO-VLEI code, a novel update-friendly bit-vector encoding scheme, based on register-length bit operations combining with the properties of Dewey Order numbers, which cannot be implemented in other relevant existing schemes such as ORDPATH. Meanwhile, the proposed method also achieves lower storage consumption because it does not require either prefix schema or any reserved codes for node insertion. We performed experiments to evaluate and compare the performance and storage consumption of the proposed method with those of the ORDPATH method. Experimental results show that the execution times for extracting depth information and parent node labels using the C-DO-VLEI code are about 25% and 15% less, respectively, and the average label size using the C-DO-VLEI code is about 24% smaller, comparing with ORDPATH.

Wenxin Liang, Akihiro Takahashi, Haruo Yokota
Inclusion Dependencies in XML: Extending Relational Semantics

In this article we define a new type of integrity constraint in XML, called an XML inclusion constraint (XIND), and show that it extends the semantics of a relational inclusion dependency. This property is important in areas such as XML publishing and ‘data-centric’ XML, and is one that is not possessed by other proposals for XML inclusion constraints. We also investigate the implication and consistency problems for XINDs in complete XML documents, a class of XML documents that generalizes the notion of a complete relation, and present an axiom system that we show to be sound and complete.

Michael Karlinger, Millist Vincent, Michael Schrefl
The Real Performance Drivers behind XML Lock Protocols

Fine-grained lock protocols should allow for highly concurrent transaction processing on XML document trees, which is addressed by the taDOM lock protocol family enabling specific lock modes and lock granules adjusted to the various XML processing models. We have already proved its operational flexibility and performance superiority when compared to competitor protocols. Here, we outline our experiences gained during the implementation and optimization of these protocols. We figure out their performance drivers to maximize throughput while keeping the response times at an acceptable level and perfectly exploiting the advantages of our tailor-made lock protocols for XML trees. Because we have implemented all options and alternatives in our prototype system XTC, benchmark runs for all “drivers” allow for comparisons in identical environments and illustrate the benefit of all implementation decisions. Finally, they reveal that careful lock protocol optimization pays off.

Sebastian Bächle, Theo Härder

XML and Databases II

XTaGe: A Flexible Generation System for Complex XML Collections

We introduce XTaGe (XML Tester and Generator), a system for the synthesis of XML collections meant for testing and micro-benchmarking applications. In contrast with existing approaches, XTaGe focuses on complex collections, by providing a highly extensible framework to introduce controlled variability in XML structures. In this paper we present the theoretical foundation, internal architecture and main features of our generator; we describe its implementation, which includes a GUI to facilitate the specification of collections; we discuss how XTaGe’s features compare with those in other XML generation systems; finally, we illustrate its usage by presenting a use case in the Bioinformatics domain.

María Pérez, Ismael Sanz, Rafael Berlanga
Utilizing XML Clustering for Efficient XML Data Management on P2P Networks

Peer-to-Peer (P2P) data integration combines the P2P infrastructure with traditional scheme-based data integration techniques. Some of the primary problems in this research area are the techniques to be used for querying, indexing and distributing documents among peers in a network especially when document files are in XML format. In order to handle this problem we describe an XML P2P system that efficiently distributes a set of clustered XML documents in a P2P network in order to speed-up user queries. The novelty of the proposed system lies in the efficient distribution of the XML documents and the construction of an appropriate virtual index on top of the network peers.

Panagiotis Antonellis, Christos Makris, Nikos Tsirakis
On the Termination Problem for Declarative XML Message Processing

We define a formal syntax and semantics for the Rule Definition Language (RDL) of DemaqLite, which is a fragment of the declarative XML message processing system Demaq. Based on this definition, we prove that the termination problem for any practically useful sublanguage of DemaqLiteRDL is undecidable, as any such language can emulate a Single Register Machine—a Turing-complete model of computation proposed by Shepherdson and Sturgis.

Tadeusz Litak, Sven Helmer

Web, Semantics and Ontologies I

Consistency Checking for Workflows with an Ontology-Based Data Perspective

Static analysis techniques for consistency checking of workflows allow to avoid runtime errors. This is in particular crucial for long running workflows where errors, detected late, can cause high costs. In many classes of workflows, the data perspective is rather simple, and the control flow perspective is the focus of consistency checking. In our setting, however, workflows are used to collect and integrate complex data based on a given domain ontology. In such scenarios, the data perspective becomes central and data consistency checking crucial.

In this paper, we motivate and sketch a simple workflow language with an ontology-based data perspective (SWOD), explain its semantics, classify possible inconsistencies, and present an algorithm for detecting such inconsistencies utilizing semantic web reasoning. We discuss soundness and completeness of the technique.

Gabriele Weiler, Arnd Poetzsch-Heffter, Stephan Kiefer
Conceptual and Spatial Footprints for Complex Systems Analysis: Application to the Semantic Web

This paper advocates the use of Formal Concept Analysis and Galois lattices for complex systems analysis. This method provides an overview of a system by indicating its main areas of interest as well as its level of specificity/generality. Moreover, it proposes possible entry points for navigation by identifying the most significant elements of the system. Automatic filtering of outliers is also provided.

This methodology is generic and may be used for any type of complex systems. In this paper, it is applied to the

Topic Map

formalism which can be used in the context of the Semantic Web to describe any kind of data as well as ontologies. The proposed Conceptual and Spatial Footprints allow the comparison of Topic Maps both in terms of content and structure. Significant concepts and relationships can be identified, as well as outliers; this method can be used to compare the underlying ontologies or datasets as illustrated in an experiment.

Bénédicte Le Grand, Michel Soto, Marie-Aude Aufaure
Automatic Extraction of Ontologies Wrapping Relational Data Sources

Describing relational data sources (i.e. databases) by means of ontologies constitutes the foundation of most of the semantic based approaches to data access and integration. In spite of the importance of the task this is mostly carried out manually and, to the best of our knowledge, not much research has been devoted to its automatisation. In this paper we introduce an automatic procedure for building ontologies starting from the integrity constraints present in the relational sources.

Our work builds upon the wide literature on database schema reverse engineering; however, we adapt these techniques to the specific purpose of reusing the extracted schemata (or ontologies) in the context of semantic data access. In particular, we ensure that the underlying data sources can be queried through the ontologies and the extracted ontologies can be used for semantic integration using recently developed techniques in this area.

In order to represent the extracted ontology we adopt a variant of the

DLR-Lite

description logic because of its ability to express the mostly used modelling constraints, and its nice computational properties. The connection with the relational data sources is captured by means of sound views. Moreover, the adoption of this formal language enables us to prove that the extracted ontologies preserve the semantics of the integrity constraints in the relational sources. Therefore, there is no data loss, and the extracted ontology constitutes a faithful wrapper of the relational sources.

Lina Lubyte, Sergio Tessaris

Temporal, Spatial, and High Dimensional Databases (Short Papers)

A Query Cache Tool for Optimizing Repeatable and Parallel OLAP Queries

On-line analytical processing against data warehouse databases is a common form of getting decision making information for almost every business field. Decision support information oftenly concerns periodic values based on regular attributes, such as sales amounts, percentages, most transactioned items, etc. This means that many similar OLAP instructions are periodically repeated, and simultaneously, between the several decision makers. Our Query Cache Tool takes advantage of previously executed queries, storing their results and the current state of the data which was accessed. Future queries only need to execute against the new data, inserted since the queries were last executed, and join these results with the previous ones. This makes query execution much faster, because we only need to process the most recent data. Our tool also minimizes the execution time and resource consumption for similar queries simultaneously executed by different users, putting the most recent ones on hold until the first finish and returns the results for all of them. The stored query results are held until they are considered outdated, then automatically erased. We present an experimental evaluation of our tool using a data warehouse based on a real-world business dataset and use a set of typical decision support queries to discuss the results, showing a very high gain in query execution time.

Ricardo Jorge Santos, Jorge Bernardino
Efficient Map Portrayal Using a General-Purpose Query Language
(A Case Study)

Fast image generation from vector or raster data for map navigation by Web clients is an important geo Web application today. Raster data obviously account for the larger volume of the underlying data sets served through WMS and other such interfaces. Dedicated server implementations prevail because an often heard argument is that general-purpose server software, such as database systems, cannot be efficient enough for such high-volume application scenarios.

In this paper we refute that. We investigate just-in-time compilation of query fragments in two variants, for CPU and GPU, as implemented in the general purpose raster DBMS rasdaman. Results suggest that array databases are suitable for realtime geo raster services.

Peter Baumann, Constantin Jucovschi, Sorin Stancu-Mara
On Low Distortion Embeddings of Statistical Distance Measures into Low Dimensional Spaces

In this paper, we investigate various statistical distance measures from the point of view of discovering low distortion embeddings into low dimensional spaces. More specifically, we consider the Mahalanobis distance measure, the Bhattacharyya class of divergences and the Kullback-Leibler divergence. We present a dimensionality reduction method based on the Johnson-Lindenstrauss Lemma for the Mahalanobis measure that achieves arbitrarily low distortion. By using the Johnson-Lindenstrauss Lemma again, we further demonstrate that the Bhattacharyya distance admits dimensionality reduction with arbitrarily low additive error. We also examine the question of embeddability into metric spaces for these distance measures due to the availability of efficient indexing schemes on metric spaces. We provide explicit constructions of point sets under the Bhattacharyya and the Kullback-Leibler divergences whose embeddings into any metric space incur arbitrarily large distortions. To the best of our knowledge, this is the first investigation into these distance measures from the point of view of dimensionality reduction and embeddability into metric spaces.

Arnab Bhattacharya, Purushottam Kar, Manjish Pal
Real-Time Traffic Flow Statistical Analysis Based on Network-Constrained Moving Object Trajectories

In this paper, we propose a novel traffic flow analysis method, Network-constrained Moving Objects Database based Traffic Flow Statistical Analysis (NMOD-TFSA) model. By sampling and analyzing the spatial-temporal trajectories of network constrained moving objects, NMOD-TFSA can get the real-time traffic conditions of the transportation network. The experimental results show that, compared with the floating-car methods which are widely used in current traffic flow analyzing systems, NMOD-TFSA provides an improved performance in terms of communication costs and statistical accuracy.

Zhiming Ding, Guangyan Huang

Invited Talk

Data Management for Federated Biobanks

Biobanks store and manage collections of biological material (tissue, blood, cell cultures, etc.) and manage the medical and biological data associated with this material. Biobanks are invaluable resources for medical research. The diversity, heterogeneity and volatility of the domain make information systems for biobanks a challenging application domain. The European project BBMRI (Biobanking and Biomolecular Resources Research Infrastructure) has the mission to network European biobanks, to improve resources for biomedical research, an thus contribute to improve the prevention, diagnosis and treatment of diseases. We present the challenges and discuss some architectures for interconnecting European biobanks and harmonizing their data.

Johann Eder, Claus Dabringer, Michaela Schicho, Konrad Stark

Web, Semantics and Ontologies II

Peer-to-Peer Semantic Wikis

Wikis have demonstrated how it is possible to convert a community of strangers into a community of collaborators. Semantic wikis have opened an interesting way to mix web 2.0 advantages with the semantic web approach. P2P wikis have illustrated how wikis can be deployed on P2P wikis and take advantages of its intrinsic qualities: fault-tolerance, scalability and infrastructure cost sharing. In this paper, we present the first P2P semantic wiki that combines advantages of semantic wikis and P2P wikis. Building a P2P semantic wiki is challenging. It requires building an optimistic replication algorithm that is compatible with P2P constraints, ensures an acceptable level of consistency and generic enough to handle semantic wiki pages. The contribution of this paper is the definition of a clear model for building P2P semantic wikis. We define the data model, operations on this model, intentions of these operations, algorithms to ensure consistency and finally we implement the SWOOKI prototype based on these algorithms.

Hala Skaf-Molli, Charbel Rahhal, Pascal Molli
VisiNav: Visual Web Data Search and Navigation

Semantic Web technologies facilitate data integration over a large number of sources with decentralised and loose coordination, ideally leading to interlinked datasets which describe objects, their attributes and links to other objects. Such information spaces are amenable to queries that go beyond traditional keyword search over documents. To this end, we present a formal query model comprising six atomic operations over object-structured datasets: keyword search, object navigation, facet selection, path traversal, projection, and sorting. Using these atomic operations, users can incrementally assemble complex queries that yield a set of objects or trees of objects as result. Results can then be either directly displayed or exported to application programs or online services. We report on user experiments carried out during the design phase of the system, and present performance results for a range of queries over 18.5m statements aggregated from 70k sources.

Andreas Harth
Diagnosing and Measuring Incompatibilities between Pairs of Services

This paper presents a technique which detects all behavioural incompatibilities between two service interfaces (a client and a provider). This may happen because the provider has evolved and its interface has been modified. It may also happen because the client decided to change for another provider which addresses the same needs but offers a different interface. Unlike prior work, the proposed solution does not simply check whether two services are incompatible or not, it rather provides detailed diagnosis, including the incompatibilities and for each one the location in the service interfaces where these incompatibilities occur. A measure of similarity between interfaces which considers outputs from the detection algorithm is proposed too.

Ali Aït-Bachir, Marie-Christine Fauvet

Database and Information System Architecture, Performance and Security (Short Papers)

Scaling-Up and Speeding-Up Video Analytics Inside Database Engine

Most conventional video processing platforms treat database merely as a storage engine rather than a computation engine, which causes inefficient data access and massive amount of data movement. Motivated by providing a convergent platform, we push down video processing to the database engine using User Defined Functions (UDFs).

However, the existing UDF technology suffers from two major limitations. First, a UDF cannot take a set of tuples as input or as output, which restricts the modeling capability for complex applications, and the tuple-wise pipelined UDF execution often leads to inefficiency and rules out the potential for enabling data-parallel computation inside the function. Next, the UDFs coded in non-SQL language such as C, either involve hard-to-follow DBMS internal system calls for interacting with the query executor, or sacrifice performance by converting input objects to strings.

To solve the above problems, we realized the notion of Relation Valued Function (RVF) in an industry-scale database engine. With tuple-set input and output, an RVF can have enhanced modeling power, efficiency and in-function data-parallel computation potential. To have RVF execution interact with the query engine efficiently, we introduced the notion of RVF

invocation patterns

and based on that developed

RVF containers

for focused system support.

We have prototyped these mechanisms on the Postgres database engine, and tested their power with Support Vector Machine (SVM) classification and learning, the most widely used analytics model for video understanding. Our experience reveals the value of the proposed approach in multiple dimensions: modeling capability, efficiency, in-function data-parallelism with multi-core CPUs, as well as usability; all these are fundamental to converging data-intensive analytics and data management.

Qiming Chen, Meichun Hsu, Rui Liu, Weihong Wang
Experimental Evaluation of Processing Time for the Synchronization of XML-Based Business Objects

Business objects (BOs) are data containers for complex data structures used in business applications such as Supply Chain Management and Customer Relationship Management. Due to the replication of application logic, multiple copies of BOs are created which have to be synchronized and updated. This is a complex and time consuming task because BOs rigorously vary in their structure according to the distribution, number and size of elements. Since BOs are internally represented as XML documents, the parsing of XML is one major cost factor which has to be considered for minimizing the processing time during synchronization. The prediction of the parsing time for BOs is an significant property for the selection of an efficient synchronization mechanism. In this paper, we present a method to evaluate the influence of the structure of BOs on their parsing time. The results of our experimental evaluation incorporating four different XML parsers examine the dependencies between the distribution of elements and the parsing time. Finally, a general cost model will be validated and simplified according to the results of the experimental setup.

Michael Ameling, Bernhard Wolf, Thomas Springer, Alexander Schill
SimulPh.D.: A Physical Design Simulator Tool

The importance of physical design has been amplified as query optimizers became sophisticated to cope with complex decision support applications. During the physical design phase, the database designer (DBD) has to select optimization techniques to improve query performance and to well manage different resources assigned for his/her databases. The decision of using these optimization techniques is taken either before or after creating the database schema. Once this decision taken, DBD has to perform three main tasks: (i) choosing one or several optimization techniques, (ii) managing interdependencies among the chosen techniques and (iii) choosing a selection algorithm for each technique. Faced to these crucial choices, the development of simulators intended to improve the quality of the physical design represents challenging issue. In this paper, we propose a simulator, called SimulPh.D that assists DBD to perform different choices thanks to user friendly graphical interfaces.

Ladjel Bellatreche, Kamel Boukhalfa, Zaia Alimazighi
Protecting Database Centric Web Services against SQL/XPath Injection Attacks

Web services represent a powerful interface for back-end database systems and are increasingly being used in business critical applications. However, field studies show that a large number of web services are deployed with security flaws (e.g., having SQL Injection vulnerabilities). Although several techniques for the identification of security vulnerabilities have been proposed, developing non-vulnerable web services is still a difficult task. In fact, security-related concerns are hard to apply as they involve adding complexity to already complex code. This paper proposes an approach to secure web services against SQL and XPath Injection attacks, by transparently detecting and aborting service invocations that try to take advantage of potential vulnerabilities. Our mechanism was applied to secure several web services specified by the TPC-App benchmark, showing to be 100% effective in stopping attacks, non-intrusive and very easy to use.

Nuno Laranjeiro, Marco Vieira, Henrique Madeira
Reasoning on Weighted Delegatable Authorizations

This paper studies logic based methods for representing and evaluating complex access control policies needed by modern database applications. In our framework, authorization and delegation rules are specified in a Weighted Delegatable Authorization Program (WDAP) which is an extended logic program. We show how extended logic programs can be used to specify complex security policies which support weighted administrative privilege delegation, weighted positive and negative authorizations, and weighted authorization propagations. We also propose a conflict resolution method that enables flexible delegation control by considering priorities of authorization grantors and weights of authorizations. A number of rules are provided to achieve delegation depth control, conflict resolution, and authorization and delegation propagations.

Chun Ruan, Vijay Varadharajan

Web, Semantics and Ontologies III

Annotating Atomic Components of Papers in Digital Libraries: The Semantic and Social Web Heading towards a Living Document Supporting eSciences

Rather than a document that is being constantly re-written as in the wiki approach, the Living Document (LD) is one that acts as a document router, operating by means of structured and organized social tagging and existing ontologies. It offers an environment where users can manage papers and related information, share their knowledge with their peers and discover hidden associations among the shared knowledge. The LD builds upon both the Semantic Web, which values the integration of well-structured data, and the Social Web, which aims to facilitate interaction amongst people by means of user-generated content. In this vein, the LD is similar to a social networking system, with users as central nodes in the network, with the difference that interaction is focused on papers rather than people. Papers, with their ability to represent research interests, expertise, affiliations, and links to web based tools and databanks, represent a central axis for interaction amongst users. To begin to show the potential of this vision, we have implemented a

novel

web prototype

that enables researchers to accomplish three activities central to the Semantic Web vision:

organizing

,

sharing

and

discovering.

Availability

: http://www.scientifik.info/

Alexander García Castro, Leyla Jael García-Castro, Alberto Labarga, Olga Giraldo, César Montaña, Kieran O’Neil, John A. Bateman
Web Navigation Sequences Automation in Modern Websites

Most today’s web sources are designed to be used by humans, but they do not provide suitable interfaces for software programs. That is why a growing interest has arisen in so-called web automation applications that are widely used for different purposes such as B2B integration, automated testing of web applications or technology and business watch. Previous proposals assume models for generating and reproducing navigation sequences that are not able to correctly deal with new websites using technologies such as AJAX: on one hand existing systems only allow recording simple navigation actions and, on the other hand, they are unable to detect the end of the effects caused by an user action. In this paper, we propose a set of new techniques to record and execute web navigation sequences able to deal with all the complexity existing in AJAX-based web sites. We also present an exhaustive evaluation of the proposed techniques that shows very promising results.

Paula Montoto, Alberto Pan, Juan Raposo, Fernando Bellas, Javier López
Supporting Personal Semantic Annotations in P2P Semantic Wikis

In this paper, we propose to extend Peer-to-Peer Semantic Wikis with personal semantic annotations. Semantic Wikis are one of the most successful Semantic Web applications. In semantic wikis, wikis pages are annotated with semantic data to facilitate the navigation, information retrieving and ontology emerging. Semantic data represents the shared knowledge base which describes the common understanding of the community. However, in a collaborative knowledge building process the knowledge is basically created by individuals who are involved in a social process. Therefore, it is fundamental to support personal knowledge building in a differentiated way. Currently there are no available semantic wikis that support both personal and shared understandings. In order to overcome this problem, we propose a P2P collaborative knowledge building process and extend semantic wikis with personal annotations facilities to express personal understanding. In this paper, we detail the personal semantic annotation model and show its implementation in P2P semantic wikis. We also detail an evaluation study which shows that personal annotations demand less cognitive efforts than semantic data and are very useful to enrich the shared knowledge base.

Diego Torres, Hala Skaf-Molli, Alicia Díaz, Pascal Molli

XML and Databases III (Short Papers)

OrdPathX: Supporting Two Dimensions of Node Insertion in XML Data

We introduce a novel XML labeling scheme called OrdPathX which supports both leaf and internal node insertions for XML data. Dynamic XML labeling has been studied for years. However almost all labeling schemes allow only leaf node insertions. Inspired by the careting-in technique of OrdPath [7], we propose a new labeling algorithm which supports internal node insertions gracefully without relabeling. We will describe the labeling algorithm and the associated operations for various inter-node relationship determination. Experimental results show that OrdPathX can handle internal node insertions efficiently.

Jing Cai, Chung Keung Poon
XQSuggest: An Interactive XML Keyword Search System

Query suggestion is extensively used in web search engines to suggest relevant queries, which can help users better express their information needs. In this paper, we explore the application of query suggestion in

xml

keyword search and propose a novel interactive

xml

query system

XQSuggest

, which mainly targets non-professional users who roughly know the contents of the database. Our system extends conventional keyword search systems by instantly suggesting several understandable semantic strings after each keyword is typed in, so that the users can easily select their desired semantic string, which represents a specific meaning of the keyword, to replace the ambiguous keyword. We provide a novel algorithm to compute the final results. Experimental results are provided to verify the better effectiveness of our system.

Jiang Li, Junhu Wang
A Prüfer Based Approach to Process Top-k Queries in XML

Top-k queries in XML involves retrieving approximate matching XML documents. Existing techniques process top-k queries in XML by applying one or more relaxations on the twig query. In this work, we investigate how Prüfer sequence can be utilized to process top-k queries in XML. We design a method called XPRAM that incorporates the relaxations into the sequence matching process. Experiment results indicate that the proposed approach is efficient and scalable.

Ling Li, Mong Li Lee, Wynne Hsu, Han Zhen
Bottom-Up Evaluation of Twig Join Pattern Queries in XML Document Databases

Since the extensible markup language XML emerged as a new standard for information representation and exchange on the Internet, the problem of storing, indexing, and querying XML documents has been among the major issues of database research. In this paper, we study the twig pattern matching and discuss a new algorithm for processing

ordered twig pattern queries

. The time complexity of the algorithmis bounded by O(|

D

|·|

Q

| + |

T

leaf

Q

) and its space overhead is by O(

leaf

T

·

leaf

Q

), where

T

stands for a document tree,

Q

for a twig pattern and

D

is a largest data stream associated with a node

q

of

Q

, which contains the database nodes that match the node predicate at

q

.

leaf

T

(

leaf

Q

) represents the number of the leaf nodes of

T

(resp.

Q

). In addition, the algorithm can be adapted to an indexing environment with

XB

-trees being used.

Yangjun Chen
Query Rewriting Rules for Versioned XML Documents

Shared and/or interactive contents such as office documents and wiki contents are often provided with both the latest version and all past versions. It is necessary to add version axes to XPath in order to trace version histories of fine-grained subdocuments of XML. Although research has been done on the containment and equivalence problems for XPath, which is a basic property of optimizing queries, there has been no research in the case for XPath extended with version axes. In this paper, we will propose query rewriting rules which can exchange between document axes and version axes, and prove that they are preserving query semantics. The rewriting rules enable us to swap path subexpressions between document axes and version axes to optimize queries.

Tetsutaro Motomura, Mizuho Iwaihara, Masatoshi Yoshikawa
Querying XML Data with SPARQL

SPARQL is today the standard access language for Semantic Web data. In the recent years XML databases have also acquired industrial importance due to the widespread applicability of XML in the Web. In this paper we present a framework that bridges the heterogeneity gap and creates an interoperable environment where SPARQL queries are used to access XML databases. Our approach assumes that fairly generic mappings between ontology constructs and XML Schema constructs have been automatically derived or manually specified. The mappings are used to automatically translate SPARQL queries to semantically equivalent XQuery queries which are used to access the XML databases. We present the algorithms and the implementation of SPARQL2XQuery framework, which is used for answering SPARQL queries over XML databases.

Nikos Bikakis, Nektarios Gioldasis, Chrisa Tsinaraki, Stavros Christodoulakis

Query Processing and Optimization I

Progressive Evaluation of XML Queries for Online Aggregation and Progress Indicator

With the rapid proliferation of XML data, large-scale online applications are emerging. In this research, we aim to enhance the XML query processors with the ability to process queries progressively and report partial results and query progress continually. The methodology lays its foundation on sampling. We shed light on how effective samples can be drawn from semi-structured XML data, as opposed to flat-table relational data. Several innovative sampling schemes on XML data are designed. The proposed methodology advances XML query processing to the next level - being more flexible, responsive, user-informed, and user-controllable, to meet emerging needs and future challenges.

Cheng Luo, Zhewei Jiang, Wen-Chi Hou, Gultekin Ozsoyoglu
Dynamic Query Processing for P2P Data Services in the Cloud

With the trend of cloud computing, data and computing are moved away from desktop and are instead provided

as a service

from the cloud. Data-as-a-service enables access to a wealth of data across distributed and heterogeneous data sources in the cloud. We designed and developed DObjects, a general-purpose P2P-based query and data operations infrastructure that can be deployed in the cloud. This paper presents the details of the dynamic query execution engine within our data query infrastructure that dynamically adapts to network and node conditions. The query processing is capable of fully benefiting from all the distributed resources to minimize the query response time and maximize system throughput. We present a set of experiments using both simulations and real implementation and deployment.

Pawel Jurczyk, Li Xiong
A Novel Air Index Scheme for Twig Queries in On-Demand XML Data Broadcast

Data broadcast is an efficient way for information dissemination in wireless mobile environments, and on-demand XML data broadcast is one of the most important research issues in this area. Indexing XML data on wireless channel is critical for this issue since energy management is very important in wireless mobile environments. Previous works have focused on air index schemes for single path queries. In this paper, we propose a novel air index scheme that builds concise air indexes for twig queries in on-demand XML data broadcast. We adopt the Document Tree structure as the basic air index structure for twig queries and propose to prune redundant structures of the basic Document Tree indexes to reduce the energy consumption. Then we propose to combine all the pruned indexes into one which can eliminate structure redundancy among the indexes to further reduce the energy consumption. Our preliminary experiments show that our air index scheme is very effective and efficient, as it builds concise air indexes and supports twig queries without losing any precision.

Yongrui Qin, Weiwei Sun, Zhuoyao Zhang, Ping Yu, Zhenying He, Weiyu Chen

Semantic Web and Ontologies IV (Short Papers)

Semantic Fields: Finding Ontology Relationships

This paper presents the Semantic Field concept which enables the global comparison of ontologies. This concept uses the results obtained from ontology matching tools to estimate the global similarity between ontologies. It has been used in a tool called the Semantic Field Tool (SemFiT). A demo tool is provided (http://khaos.uma.es/SFD).

Ismael Navas-Delgado, Maria del Mar Roldán-García, José F. Aldana-Montes
Complete OWL-DL Reasoning Using Relational Databases

Real Semantic Web applications, such as biological tools, use large ontologies, that is, ontologies with a large number (millions) of instances. Due to the increasing development of such applications, it is necessary to provide scalable and efficient ontology querying and reasoning systems. DBOWL is a Persistent and Scalable OWL reasoner which stores ontologies and implements reasoning using a relational database. In this paper we present an extension of DBOWL that implements all inference rules for OWL-DL. Furthermore, we describe briefly the reasoning algorithms and their completeness proofs.

Maria del Mar Roldan-Garcia, Jose F. Aldana-Montes
FRESG: A Kind of Fuzzy Description Logic Reasoner

Based on the fuzzy description logic

F-ALC

(

G

), we design and implement a fuzzy description logic reasoner, named FRESG1.0. FRESG1.0 can support the representation and reasoning of fuzzy data information with customized fuzzy data types and customized fuzzy data type predicates. We briefly introduce the reasoning services provided by FRESG1.0. Then, we particularize the overall architecture of FRESG1.0 and its design and implementation of the major components. In the paper, we pay more attention to illustrate the features of the reasoner as well as the algorithms and technologies adopted in the implementations.

Hailong Wang, Z. M. Ma, Junfu Yin
Extracting Related Words from Anchor Text Clusters by Focusing on the Page Designer’s Intention

Approaches for extracting related words (terms) by co-occurrence work poorly sometimes. Two words frequently co-occurring in the same documents are considered related. However, they may not relate at all because they would have no common meanings nor similar semantics. We address this problem by considering the page designer’s intention and propose a new model to extract related words. Our approach is based on the idea that the web page designers usually make the correlative hyperlinks appear in close zone on the browser. We developed a browser-based crawler to collect “geographically” near hyperlinks, then by clustering these hyperlinks based on their pixel coordinates, we extract related words which can well reflect the designer’s intention. Experimental results show that our method can represent the intention of the web page designer in extremely high precision. Moreover, the experiments indicate that our extracting method can obtain related words in a high average precision.

Jianquan Liu, Hanxiong Chen, Kazutaka Furuse, Nobuo Ohbo

Invited Talk

Evolution of Query Optimization Methods: From Centralized Database Systems to Data Grid Systems

The purpose of this talk is to provide a comprehensive state of the art concerning the evolution of query optimization methods from centralized database systems to data Grid systems through parallel, distributed and data integration systems. For each environment, we try to describe synthetically some methods, and point out their main characteristics.

Abdelkader Hameurlain

Query Processing and Optimization II

Reaching the Top of the Skyline: An Efficient Indexed Algorithm for Top-k Skyline Queries

Criteria that induce a Skyline naturally represent user’s preference conditions useful to discard irrelevant data in large datasets. However, in the presence of high-dimensional Skyline spaces, the size of the Skyline can still be very large, making unfeasible for users to process this set of points. To identify the best points among the Skyline, the Top-k Skyline approach has been proposed. Top-k Skyline uses discriminatory criteria to induce a total order of the points that comprise the Skyline, and recognizes the best or top-k objects based on these criteria. Different algorithms have been defined to compute the top-k objects among the Skyline; while existing solutions are able to produce the Top-k Skyline, they may be very costly. First, state-of-the-art Top-k Skyline solutions require the computation of the whole Skyline; second, they execute probes of the multicriteria function over the whole Skyline points. Thus, if

k

is much smaller than the cardinality of the Skyline, these solutions may be very inefficient because a large number of non-necessary probes may be evaluated. In this paper, we propose the TKSI, an efficient solution for the Top-k Skyline that overcomes existing solutions drawbacks. The TKSI is an index-based algorithm that is able to compute only the subset of the Skyline that will be required to produce the top-k objects; thus, the TKSI is able to minimize the number of non-necessary probes. We have empirically studied the quality of TKSI, and we report initial experimental results that show the TKSI is able to speed up the computation of the Top-k Skyline in at least 50% percent w.r.t. the state-of-the-art solutions, when

k

is smaller than the size of the Skyline.

Marlene Goncalves, María-Esther Vidal
Energy Efficient and Progressive Strategy for Processing Skyline Queries on Air

Computing skyline and its variations is attracting a lot of attention in the database community, however, processing the queries in wireless broadcast environments is an uncovered problem despite of its unique benefits compared to the other environments. In this paper, we propose a strategy to process skyline queries for the possible expansion of current data broadcasting services. For the energy efficient processing of the skyline queries, the Sweep space-filling curve is utilized based on the existing DSI structure to generate broadcast program at a server side. The corresponding algorithms of processing skyline queries are also proposed for the mobile clients. Moreover, we extend the DSI structure based on a novel concept of Minimized Dominating Points (MDP) in order to provide a progressive algorithm of the queries. We evaluate our strategy by performing a simulation, and the experimental results demonstrate the energy efficiency of the proposed methods.

JongWoo Ha, Yoon Kwon, Jae-Ho Choi, SangKeun Lee
RoK: Roll-Up with the K-Means Clustering Method for Recommending OLAP Queries

Dimension hierarchies represent a substantial part of the data warehouse model. Indeed they allow decision makers to examine data at different levels of detail with On-Line Analytical Processing (OLAP) operators such as drill-down and roll-up. The granularity levels which compose a dimension hierarchy are usually fixed during the design step of the data warehouse, according to the identified analysis needs of the users. However, in practice, the needs of users may evolve and grow in time. Hence, to take into account the users’ analysis evolution into the data warehouse, we propose to integrate personalization techniques within the OLAP process. We propose two kinds of OLAP personalization in the data warehouse: (1) adaptation and (2) recommendation.

Adaptation allows users to express their own needs in terms of aggregation rules defined from a child level (existing level) to a parent level (new level). The system will adapt itself by including the new hierarchy level into the data warehouse schema. For recommending new OLAP queries, we provide a new OLAP operator based on the K-means method. Users are asked to choose K-means parameters following their preferences about the obtained clusters which may form a new granularity level in the considered dimension hierarchy. We use the K-means clustering method in order to highlight aggregates semantically richer than those provided by classical OLAP operators. In both adaptation and recommendation techniques, the new data warehouse schema allows new and more elaborated OLAP queries.

Our approach for OLAP personalization is implemented within Oracle 10 g as a prototype which allows the creation of new granularity levels in dimension hierachies of the data warehouse. Moreover, we carried out some experiments which validate the relevance of our approach.

Fadila Bentayeb, Cécile Favre

Query Processing and Optimization III

On Index-Free Similarity Search in Metric Spaces

Metric access methods (MAMs) serve as a tool for speeding similarity queries. However, all MAMs developed so far are index-based; they need to build an index on a given database. The indexing itself is either static (the whole database is indexed at once) or dynamic (insertions/deletions are supported), but there is always a preprocessing step needed. In this paper, we propose

D-file

, the first MAM that requires no indexing at all. This feature is especially beneficial in domains like data mining, streaming databases, etc., where the production of data is much more intensive than querying. Thus, in such environments the indexing is the bottleneck of the entire production/querying scheme. The idea of D-file is an extension of the trivial sequential file (an abstraction over the original database, actually) by so-called

D-cache

. The D-cache is a main-memory structure that keeps track of distance computations spent by processing all similarity queries so far (within a runtime session). Based on the distances stored in D-cache, the D-file can cheaply determine lower bounds of some distances while the distances alone have not to be explicitly computed, which results in faster queries. Our experimental evaluation shows that query efficiency of D-file is comparable to the index-based state-of-the-art MAMs, however, for zero indexing costs.

Tomáš Skopal, Benjamin Bustos
An Approximation Algorithm for Optimizing Multiple Path Tracking Queries over Sensor Data Streams

Sensor networks have received considerable attention in recent years and played an important role in data collection applications. Sensor nodes have limited supply of energy. Therefore, one of the major design considerations for sensor applications is to reduce the power consumption. In this paper, we study an application that combines RFID and sensor network technologies to provide an environment for moving object path tracking, which needs efficient join processing. This paper considers multi-query optimization to reduce query evaluation cost, and therefore power consumption. We formulate the multi-query optimization problem and present a novel approximation algorithm which provides solutions with suboptimal guarantees. In addition, extensive experiments are made to demonstrate the performance of the proposed optimization strategy.

Yao-Chung Fan, Arbee L. P. Chen

Data and Information Integration and Quality

A Versatile Record Linkage Method by Term Matching Model Using CRF

We solve the problem of record linkage between databases where record fields are mixed and permuted in different ways. The solution method uses a conditional random fields model to find matching terms in record pairs and uses matching terms in the duplicate detection process. Although records with permuted fields may have partly reordered terms, our method can still utilize local orders of terms for finding matching terms. We carried out experiments on several well-known data sets in record linkage research, and our method showed its advantages on most of the data sets. We also did experiments on a synthetic data set, in which records combined fields in random order, and verified that it could handle even this data set.

Quang Minh Vu, Atsuhiro Takasu, Jun Adachi
On-the-Fly Integration and Ad Hoc Querying of Life Sciences Databases Using LifeDB

Data intensive applications in Life Sciences extensively use the hidden web as a platform for information sharing. Access to these heterogeneous hidden web resources is limited through the use of predefined web forms and interactive interfaces that users navigate manually, and assume responsibility for reconciling schema heterogeneity, extracting information and piping, transforming formats and so on in order to implement desired query sequences or scientific work flows. In this paper, we present a new data management system, called

LifeDB

, in which we offer support for currency without view materialization, and autonomous reconciliation of schema heterogeneity in one single platform through a declarative query language called

BioFlow

. In our approach, schema heterogeneity is resolved at run time by treating the hidden web resources as a virtual warehouses, and by supporting a set of primitives for data integration on-the-fly, extracting information and piping to other resources, and manipulating data in a way similar to traditional database systems to respond to application demands.

Anupam Bhattacharjee, Aminul Islam, Mohammad Shafkat Amin, Shahriyar Hossain, Shazzad Hosain, Hasan Jamil, Leonard Lipovich
Analyses and Validation of Conditional Dependencies with Built-in Predicates

This paper proposes a natural extension of conditional functional dependencies (

cfd

s [14]) and conditional inclusion dependencies (

cind

s [8]), denoted by

cfd

p

s and

cind

p

s, respectively, by specifying patterns of data values with ≠, <, ≤, > and ≥ predicates. As data quality rules,

cfd

p

s and

cind

p

s are able to capture errors that commonly arise in practice but cannot be detected by

cfd

s and

cind

s. We establish two sets of results for central technical problems associated with

cfd

p

s and

cind

p

s. (a) One concerns the satisfiability and implication problems for

cfd

p

s and

cind

p

s, taken separately or together. These are important for,

e.g.,

deciding whether data quality rules are dirty themselves, and for removing redundant rules. We show that despite the increased expressive power, the static analyses of

cfd

p

s and

cind

p

s retain the same complexity as their

cfd

s and

cind

s counterparts. (b) The other concerns validation of

cfd

p

s and

cind

p

s. We show that given a set

$\it \Sigma$

of

cfd

p

s and

cind

p

s on a database

D

, a set of

sql

queries can be automatically generated that, when evaluated against

D

, return all tuples in

D

that violate some dependencies in

$\it \Sigma$

. This provides commercial

dbms

with an immediate capability to detect errors based on

cfd

p

s and

cind

p

s.

Wenguang Chen, Wenfei Fan, Shuai Ma

Data Mining and Knowledge Extraction (Short Papers)

Discovering Sentinel Rules for Business Intelligence

This paper proposes the concept of

sentinel rules

for multi-dimensional data that warns users when measure data concerning the external environment changes. For instance, a surge in negative blogging about a company could trigger a sentinel rule warning that revenue will decrease within two months, so a new course of action can be taken. Hereby, we expand the window of opportunity for organizations and facilitate successful navigation even though the world behaves chaotically. Since sentinel rules are at the schema level as opposed to the data level, and operate on data

changes

as opposed to absolute data values, we are able to discover strong and useful sentinel rules that would otherwise be hidden when using sequential pattern mining or correlation techniques. We present a method for sentinel rule discovery and an implementation of this method that scales linearly on large data volumes.

Morten Middelfart, Torben Bach Pedersen
Discovering Trends and Relationships among Rules

Data repositories are constantly evolving and techniques are needed to reveal the dynamic behaviors in the data that might be useful to the user. Existing temporal association rules mining algorithms consider time as another dimension and do not describe the behavior of rules over time. In this work, we introduce the notion of trend fragment to facilitate the analysis of relationships among rules. Two algorithms are proposed to find the relationships among rules. Experiment results on both synthetic and real-world datasets indicate that our approach is scalable and effective.

Chaohai Chen, Wynne Hsu, Mong Li Lee
Incremental Ontology-Based Extraction and Alignment in Semi-structured Documents

SHIRI

is an ontology-based system for integration of semi-structured documents related to a specific domain. The system’s purpose is to allow users to access to relevant parts of documents as answers to their queries.

SHIRI

uses RDF/OWL for representation of resources and SPARQL for their querying. It relies on an automatic, unsupervised and ontology-driven approach for extraction, alignment and semantic annotation of tagged elements of documents. In this paper, we focus on the

Extract-Align

algorithm which exploits a set of named entity and term patterns to extract term candidates to be aligned with the ontology. It proceeds in an incremental manner in order to populate the ontology with terms describing instances of the domain and to reduce the access to extern resources such as Web. We experiment it on a HTML corpus related to call for papers in computer science and the results that we obtain are very promising. These results show how the incremental behaviour of

Extract-Align

algorithm enriches the ontology and the number of terms (or named entities) aligned directly with the ontology increases.

Mouhamadou Thiam, Nacéra Bennacer, Nathalie Pernelle, Moussa Lô
Tags4Tags: Using Tagging to Consolidate Tags

Tagging has become increasingly popular and useful across various social networks and applications. It allows users to classify and organize resources for improving the retrieval performance over those tagged resources. Within social networks, tags can also facilitate the interaction between members of the community,

e.g.

because similar tags may represent similar interests. Although obviously useful for straightforward retrieval tasks, the current meta-data model underlying typical tagging systems does not fully exploit the potential of the social process of finding, establishing, challenging, and promoting symbols,

i.e.

tags. For instance, the social process is not used for establishing an explicit hierarchy of tags or for the collective detection of equivalencies, synonyms, morphological variants, and other useful relationships across tags. This limitation is due to the constraints of the typical meta-model of tagging, in which the subject must be a Web resource, the relationship type is always

hasTag,

and the object must be a tag as a literal. In this paper, we propose a simple yet effective extension for the current meta-model of tagging systems in order to exploit the potential of collective tagging for the emergence of richer semantic structures, in particular for capturing semantic relationships between tags. Our approach expands the range of the object of tagging from Web resources only to the union of (1) Web resources and (2) pairs of tags, i.e., users can now use arbitrary tags for expressing typed relationships between a pair of tags. This allows the user community to establish similarity relations and other types of relationships between tags. We present a first prototype and the results from an evaluation in a small controlled setting.

Leyla Jael Garcia-Castro, Martin Hepp, Alexander Garcia

Data and Information Streams

Detecting Projected Outliers in High-Dimensional Data Streams

In this paper, we study the problem of projected outlier detection in high dimensional data streams and propose a new technique, called

S

tream

P

rojected

O

uliter de

T

ector (SPOT), to identify outliers embedded in subspaces. Sparse Subspace Template (SST), a set of subspaces obtained by unsupervised and/or supervised learning processes, is constructed in SPOT to detect projected outliers effectively. Multi-Objective Genetic Algorithm (MOGA) is employed as an effective search method for finding outlying subspaces from training data to construct SST. SST is able to carry out online self-evolution in the detection stage to cope with dynamics of data streams. The experimental results demonstrate the efficiency and effectiveness of SPOT in detecting outliers in high-dimensional data streams.

Ji Zhang, Qigang Gao, Hai Wang, Qing Liu, Kai Xu
Significance-Based Failure and Interference Detection in Data Streams

Detecting the failure of a data stream is relatively easy when the stream is continually full of data. The transfer of large amounts of data allows for the simple detection of interference, whether accidental or malicious. However, during interference, data transmission can become irregular, rather than smooth. When the traffic is intermittent, it is harder to detect when failure has occurred and may lead to an application at the receiving end requesting retransmission or disconnecting. Request retransmission places additional load on a system and disconnection can lead to unnecessary reversion to a checkpointed database, before reconnecting and reissuing the same request or response. In this paper, we model the traffic in data streams as a set of significant events, with an arrival rate distributed with a Poisson distribution. Once an arrival rate has been determined, over-time, or lost, events can be determined with a greater chance of reliability. This model also allows for the alteration of the rate parameter to reflect changes in the system and provides support for multiple levels of data aggregation. One significant benefit of the Poisson-based model is that transmission events can be deliberately manipulated in time to provide a steganographic channel that confirms sender/receiver identity.

Nickolas J. G. Falkner, Quan Z. Sheng
Incremental and Adaptive Clustering Stream Data over Sliding Window

Cluster analysis has played a key role in data stream understanding. The problem is difficult when the clustering task is considered in a sliding window model in which the requirement of outdated data elimination must be dealt with properly. We propose SWEM algorithm that is designed based on the Expectation Maximization technique to address these challenges. Equipped in SWEM is the capability to compute clusters incrementally using a small number of statistics summarized over the stream and the capability to adapt to the stream distribution’s changes. The feasibility of SWEM has been verified via a number of experiments and we show that it is superior than Clustream algorithm, for both synthetic and real datasets.

Xuan Hong Dang, Vincent C. S. Lee, Wee Keong Ng, Kok Leong Ong

Data Mining Algorithms

Alignment of Noisy and Uniformly Scaled Time Series

The alignment of noisy and uniformly scaled time series is an important but difficult task. Given two time series, one of which is a uniformly stretched subsequence of the other, we want to determine the stretching factor and the offset of the second time series within the first one. We adapted and enhanced different methods to address this problem: classical FFT-based approaches to determine the offset combined with a naïve search for the stretching factor or its direct computation in the frequency domain, bounded dynamic time warping and a new approach called shotgun analysis, which is inspired by sequencing and reassembling of genomes in bioinformatics. We thoroughly examined the strengths and weaknesses of the different methods on synthetic and real data sets. The FFT-based approaches are very accurate on high quality data, the shotgun approach is especially suitable for data with outliers. Dynamic time warping is a candidate for non-linear stretching or compression. We successfully applied the presented methods to identify steel coils via their thickness profiles.

Constanze Lipowsky, Egor Dranischnikow, Herbert Göttler, Thomas Gottron, Mathias Kemeter, Elmar Schömer
Extracting Decision Correlation Rules

In this paper, two concepts are introduced: decision correlation rules and contingency vectors. The first concept results from a cross fertilization between correlation and decision rules. It enables relevant links to be highlighted between sets of patterns of a binary relation and the values of target items belonging to the same relation on the twofold basis of the Chi-Squared measure and of the support of the extracted patterns. Due to the very nature of the problem, levelwise algorithms only allow extraction of results with long execution times and huge memory occupation. To offset these two problems, we propose an algorithm based both on the lectic order and contingency vectors, an alternate representation of contingency tables.

Alain Casali, Christian Ernst
Maintaining the Dominant Representatives on Data Streams

It is well known that traditional skyline query is very likely to return over many but less informative data points in the result, especially when the querying dataset is high-dimensional or anti-correlated. In data stream applications where large amounts of data are continuously generated, this problem becomes much more serious since the full skyline result cannot be obtained efficiently and analyzed easily. To cope with this difficulty, in this paper, we propose a new concept called

Combinatorial Dominant

relationship to abstract dominant representatives of stream data. Based on this concept, we propose three novel skyline queries, namely

basic convex skyline query

(BCSQ)

,

dynamic convex skyline query

(DCSQ)

, and

reverse convex skyline query

(RCSQ)

, combining the concepts of convex derived from geometry and the traditional skyline for the first time. These queries can adaptively abstract the contour of skyline points without specifying the size of result set in advance and promote information content of the query result. To efficiently process these queries and maintain their results, we design and analyze algorithms by exploiting a memory indexing structure called DCEL which is used to represent and store the arrangement of data in the sliding window. We convert the problems of points in the primal plane into those of lines in dual plane through dual transformation, which helps us avoid expensive full skyline computation and speeds up the candidate set selection. Finally, through extensive experiments with both real and synthetic datasets, we validate the representative capability of CSQs, as well as the performance of our proposed algorithms.

Wenlin He, Cuiping Li, Hong Chen

Data and Information Modelling

Modeling Complex Relationships

Real world objects have various natural and complex relationships with each other and via these relationships, objects play various roles that form their context and then have the corresponding context-dependent properties. Existing data models such as object-oriented models and role models cannot naturally and directly represent such complex relationships and context-dependent properties. In this paper, we present a method to provide such natural and direct support.

Mengchi Liu, Jie Hu
Intuitive Visualization-Oriented Metamodeling

In this article we present a metamodeling tool that is strictly oriented towards the needs of the working domain expert. The working domain expert longs for intuitive metamodeling features. In particular this concerns rich capabilities for specifying the visual appearance of models. In these efforts we have identified an important design rationale for metamodeling tools that we call visual reification – the notion that metamodels are visualized the same way as their instances. In our tool we support both, standard metamodeling features and new metamodeling features that are oriented towards the visual reification principle. We will start an unbiased discussion of the pragmatics of metamodeling tools against the background of this design rationale.

Dirk Draheim, Melanie Himsl, Daniel Jabornig, Werner Leithner, Peter Regner, Thomas Wiesinger

Information Retrieval and Database Systems (Short Papers)

PISA: Federated Search in P2P Networks with Uncooperative Peers

Recently, federated search in P2P networks has received much attention. Most of the previous work assumed a cooperative environment where each peer can actively participate in information publishing and distributed document indexing. However, little work has addressed the problem of incorporating uncooperative peers, which do not publish their own corpus statistics, into a network. This paper presents a P2P-based federated search framework called PISA which incorporates uncooperative peers as well as the normal ones. In order to address the indexing needs for uncooperative peers, we propose a novel heuristic query-based sampling approach which can obtain high-quality resource descriptions from uncooperative peers at relatively low communication cost. We also propose an effective method called RISE to merge the results returned by uncooperative peers. Our experimental results indicate that PISA can provide quality search results, while utilizing the uncooperative peers at a low cost.

Zujie Ren, Lidan Shou, Gang Chen, Chun Chen, Yijun Bei
Analysis of News Agencies’ Descriptive Features of People and Organizations

News agencies report news from different viewpoints and with different writing styles. We propose a method to extract characteristic descriptions of a news agency written about people and organizations. To extract the characteristic descriptions of a given person or organization, we analyze words which appear in the same sentence on the basis of their SVO roles. We then extract a description that is often used by the news agency but not commonly used by the others. The experimental results show that our method can elucidate the different features of each agency’s writing style.

Shin Ishida, Qiang Ma, Masatoshi Yoshikawa
Analyzing Document Retrievability in Patent Retrieval Settings

Most information retrieval settings, such as web search, are typically precision-oriented, i.e. they focus on retrieving a small number of highly relevant documents. However, in specific domains, such as patent retrieval or law, recall becomes more relevant than precision: in these cases the goal is to find all relevant documents, requiring algorithms to be tuned more towards recall at the cost of precision. This raises important questions with respect to retrievability and search engine bias: depending on how the similarity between a query and documents is measured, certain documents may be more or less retrievable in certain systems, up to some documents not being retrievable at all within common threshold settings. Biases may be oriented towards popularity of documents (increasing weight of references), towards length of documents, favour the use of rare or common words; rely on structural information such as metadata or headings, etc. Existing accessibility measurement techniques are limited as they measure retrievability with respect to all possible queries. In this paper, we improve accessibility measurement by considering sets of relevant and irrelevant queries for each document. This simulates how recall oriented users create their queries when searching for relevant information. We evaluate retrievability scores using a corpus of patents from US Patent and Trademark Office.

Shariq Bashir, Andreas Rauber
Classifying Web Pages by Using Knowledge Bases for Entity Retrieval

In this paper, we propose a novel method to classify Web pages by using knowledge bases for entity search, which is a kind of typical Web search for information related to a person, location or organization. First, we map a Web page to entities according to the similarities between the page and the entities. Various methods for computing such similarity are applied. For example, we can compute the similarity between a given page and a Wikipedia article describing a certain entity. The frequency of an entity appearing in the page is another factor used in computing the similarity. Second, we construct a directed acyclic graph, named PEC graph, based on the relations among Web pages, entities, and categories, by referring to YAGO, a knowledge base built on Wikipedia and WordNet. Finally, by analyzing the PEC graph, we classify Web pages into categories. The results of some preliminary experiments validate the methods proposed in this paper.

Yusuke Kiritani, Qiang Ma, Masatoshi Yoshikawa
Terminology Extraction from Log Files

The log files generated by digital systems can be used in management information systems as the source of important information on the condition of systems. However, log files are not exhaustively exploited in order to extract information. The classical methods of information extraction such as terminology extraction methods are irrelevant to this context because of the specific characteristics of log files like their heterogeneous structure, the special vocabulary and the fact that they do not respect a natural language grammar. In this paper, we introduce our approach

Exterlog

to extract the terminology from log files. We detail how it deals with the particularity of such textual data.

Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet, Mathieu Roche

Database and Information System Architecture and Performance

Evaluating Non-In-Place Update Techniques for Flash-Based Transaction Processing Systems

Recently, flash memory is emerging as the storage device. With price sliding fast, the cost per capacity is approaching to that of SATA disk drives. So far flash memory has been widely deployed in consumer electronics even partly in mobile computing environments. For enterprise systems, the deployment has been studied by many researchers and developers. In terms of the access performance characteristics, flash memory is quite different from disk drives. Without the mechanical components, flash memory has very high random read performance, whereas it has a limited random write performance because of the erase-before-write design. The random write performance of flash memory is comparable with or even worse than that of disk drives. Due to such a performance asymmetry, naive deployment to enterprise systems may not exploit the potential performance of flash memory at full blast. This paper studies the effectiveness of using non-in-place-update (NIPU) techniques through the IO path of flash-based transaction processing systems. Our deliberate experiments using both open-source DBMS and commercial DBMS validated the potential benefits; x3.0 to x6.6 performance improvement was confirmed by incorporating non-in-place-update techniques into file system without any modification of applications or storage devices.

Yongkun Wang, Kazuo Goda, Masaru Kitsuregawa
A Relational Encoding of a Conceptual Model with Multiple Temporal Dimensions

The theoretical interest and the practical relevance of a systematic treatment of multiple temporal dimensions is widely recognized in the database and information system communities. Nevertheless, most relational databases have no temporal support at all. A few of them provide a limited support, in terms of temporal data types and predicates, constructors, and functions for the management of time values (borrowed from the SQL standard). One (resp., two) temporal dimensions are supported by historical and transaction-time (resp., bitemporal) databases only. In this paper, we provide a relational encoding of a conceptual model featuring four temporal dimensions, namely, the classical valid and transaction times, plus the event and availability times. We focus our attention on the distinctive technical features of the proposed temporal extension of the relation model. In the last part of the paper, we briefly show how to implement it in a standard DBMS.

Donatella Gubiani, Angelo Montanari
Three Approximation Algorithms for Energy-Efficient Query Dissemination in Sensor Database System

Sensor database is a type of database management system which offers sensor data and stored data in its data model and query languages. In this system, when a user poses a query to this sensor database, the query will be disseminated across the database. During this process, each sensor generates data that match the query from its covered area and then returns the data to the original sensor. In order to achieve an energy-efficient implementation, it will be useful to select a minimally sufficient subset of sensors to keep active at any given time. Thus, how to find a subset efficiently is an important problem for sensor database system. We define this problem as

sensor

database

coverage

(SDC) problem.

In this paper, we reduce the SDC problem to

connected set cover

problem, then present two approximation algorithms to select a minimum connected set cover for a given sensor database. Moreover, to guarantee robustness and accuracy, we require a fault-tolerant sensor database, which means that each target in a query region will be covered by at least

m

sensors, and the selected sensors will form a

k

-connected subgraph. We name this problem as (

k

,

m

)-SDC problem and design another approximation algorithm. These three algorithms are the first approximation algorithms with guaranteed approximation ratios to SDC problem. We also provide simulations to evaluate the performance of our algorithms. We compare the results with algorithms in [17]. The comparison proves the efficiency of our approximations. Thus, our algorithms will become a new efficient approach to solve coverage problem in sensor database systems.

Zhao Zhang, Xiaofeng Gao, Xuefei Zhang, Weili Wu, Hui Xiong

Query Processing and Optimization IV (Short Papers)

Top-k Answers to Fuzzy XPath Queries

Data heterogeneity in XML repositories can be tackled by giving users the possibility to obtain approximate answers to their queries. In this setting, several approaches for XPath queries have been defined in the literature. In particular, fuzzy XPath queries have been recently introduced as a formalism to provide users with a clear understanding of the approximations that the query evaluation process introduces in the answers. However, in many cases, users are not a-priori aware of the maximum approximation degree they would allow in the answers; rather, they are interested in obtaining the first

k

answers ranked according to their approximation degrees. In this paper we investigate the problem of top-

k

fuzzy XPath querying, propose a query language and its associated semantics, and discuss query evaluation.

Bettina Fazzinga, Sergio Flesca, Andrea Pugliese
Deciding Query Entailment in Fuzzy Description Logic Knowledge Bases

Existing fuzzy description logic (DL) reasoners either are not capable of answering conjunctive queries, or only apply to DLs with less expressivity. In this paper, we present an algorithm for answering expressive fuzzy conjunctive queries, which allows the occurrence of both lower bound and the upper bound of thresholds in a query atom, over the relative expressive DL, namely fuzzy

$\mathcal{ALCN}$

. Our algorithm is specially tailored for deciding conjunctive query entailment of negative role atoms in the form of

R

(

x

,

y

) ≤ 

n

or

R

(

x

,

y

) < 

n

which, to the best of our knowledge, has not been touched on in other literatures.

Jingwei Cheng, Z. M. Ma, Fu Zhang, Xing Wang
An Optimization Technique for Multiple Continuous Multiple Joins over Data Streams

Join queries having heavy cost are necessary to Data Stream Management System in the sensor network. In this paper, we propose an optimization algorithm for multiple continuous join operators over data streams using a heuristic strategy. First, we propose a solution of building the global shared query execution plan. Second, we solve the problems of updating a window size and routing for a join result. Our experimental results show that the proposed protocol can provide better throughputs than previous methods.

Changwoo Byun, Hunjoo Lee, YoungHa Ryu, Seog Park
Top-k Queries with Contextual Fuzzy Preferences

This paper deals with the interpretation of database queries with preference conditions of the form “attribute is low (resp. medium, high)” in the situation where the user is not aware of the actual content of the database but still wants to retrieve the best possible answers (relatively to that content). An approach to the definition of the terms “low”, “medium” and “high” in a contextual and relative manner is introduced.

Patrick Bosc, Olivier Pivert, Amine Mokhtari
Reranking and Classifying Search Results Exhaustively Based on Edit-and-Propagate Operations

Search engines return a huge number of Web search results, and the user usually checks merely the top 5 or 10 results. However, the user sometimes must collect information exhaustively such as collecting all the publications which a certain person had written, or gathering a lot of useful information which assists the user to buy. In this case, the user must repeatedly check search results that are clearly irrelevant. We believe that people would use a search system which provides the reranking or classifying functions by the user’s interaction. We have already proposed a reranking system based on the user’s edit-and-propagate operations. In this paper, we introduce the drag-and-drop operation into our system to support the user’s exhaustive search.

Takehiro Yamamoto, Satoshi Nakamura, Katsumi Tanaka
Backmatter
Metadaten
Titel
Database and Expert Systems Applications
herausgegeben von
Sourav S. Bhowmick
Josef Küng
Roland Wagner
Copyright-Jahr
2009
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-03573-9
Print ISBN
978-3-642-03572-2
DOI
https://doi.org/10.1007/978-3-642-03573-9

Neuer Inhalt