Skip to main content

2006 | Buch

Database and Expert Systems Applications

17th International Conference, DEXA 2006, Kraków, Poland, September 4-8, 2006. Proceedings

herausgegeben von: Stéphane Bressan, Josef Küng, Roland Wagner

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The annual international conference on Database and Expert Systems Applications (DEXA) is now well established as a reference scientific event. The reader will find in this volume a collection of scientific papers that represent the state of the art of research in the domain of data, information and knowledge management, intelligent systems, and their applications. The 17th instance of the series of DEXA conferences was held at the Andrzej Frycz Modrzewski Cracow College in Kraków, Poland, during September 4–8, 2006. Several collocated conferences and workshops covered specialized and complementary topics to the main conference topic. Four conferences ? the 8th International Conference on Data Warehousing and Knowledge Discovery (DaWaK), the 7th International Conference on Electronic Commerce and Web Technologies (EC-Web), the 5th International Conference on Electronic Government (EGOV), and the Third International Conference on Trust, Privacy, and Security in Digital Business (TrustBus) ? and 14 workshops were collocated with DEXA. The whole forms a unique international event with a balanced depth and breadth of topics. Its much-appreciated conviviality fosters unmatched opportunities to meet, share the latest scientific results and discuss the latest technological advances in the area of information technologies with both young scientists and engineers and senior world-renown experts. This volume contains the papers selected for presentation at the conference. Each submitted paper was reviewed by three or four reviewers, members of the Program Committee or external reviewers appointed by members of the Program Committee.

Inhaltsverzeichnis

Frontmatter

XML I

Efficient Processing of Multiple XML Twig Queries

Finding all occurrences of a twig pattern in an XML document is a core operation for XML query processing. The emergence of XML as a common mark-up language for data interchange has spawned great interest in techniques for filtering and content-based routing of XML data. In this paper, we aim to use the state-of-art holistic twig join technique to address multiple twig queries in a large scale XML database. We propose a new twig query technique which is specially tailored to match documents with large numbers of twig pattern queries. We introduce the

super-twig

to represent multiple twig queries. Based on the

super-twig

, we design a holistic twig join algorithm, called

MTwigStack

, to find all matches for multiple twig queries by scanning an XML document only once.

Huanzhang Liu, Tok Wang Ling, Tian Yu, Ji Wu
Effectively Scoring for XML IR Queries

This paper focuses on relevance scoring for XML IR queries. We propose a novel and effective algorithm for relevance scoring, which considers both structural information and semantic. Experiments show that the algorithm can effectively improve the Precision and Recall for XML information retrieval.

Zhongming Han, Jiajin Le, Beijin Shen
Selectively Storing XML Data in Relations

This paper presents a new framework for users to select relevant data from an

xml

document and store it in an existing relational database, as opposed to previous approaches that shred the entire

xml

document into a newly created database of a newly designed schema. The framework is based on a notion of

xml2db

mappings. An

xml2db

mapping extends a (

possibly recursive

)

dtd

by associating element types with semantic attributes and rules. It extracts either part or all of the data from an

xml

document, and generates

sql

updates to increment an existing database using the

xml

data. We also provide an efficient technique to evaluate

xml2db

mappings in parallel with

sax

parsing. These yield a systematic method to store

xml

data selectively in an existing database.

Wenfei Fan, Lisha Ma

Data and Information I

A Context-Aware Preference Model for Database Querying in an Ambient Intelligent Environment

Users’ preferences have traditionally been exploited in query personalization to better serve their information needs. With the emerging ubiquitous computing technologies, users will be situated in an Ambient Intelligent (AmI) environment, where users’ database access will not occur at a single location in a single context as in the traditional stationary desktop computing, but rather span a multitude of contexts like office, home, hotel, plane, etc. To deliver personalized query answering in this environment, the need for context-aware query preferences arises accordingly. In this paper, we propose a knowledge-based context-aware query preference model, which can cater for both

pull

and

push

queries. We analyze requirements and challenges that AmI poses upon such a model and discuss the interpretation of the model in the domain of relational databases. We implant the model on top of a traditional DBMS to demonstrate the applicability and feasibility of our approach.

Arthur H. van Bunningen, Ling Feng, Peter M. G. Apers
Andromeda: Building e-Science Data Integration Tools

This paper presents

Andromeda

(Astronomical Data Resources Mediation), an XML-based data mediation system that enables transparent access to astronomical data sources. Transparent access is achieved by a global view that expresses the requirements of a community of users (i.e., astronomers) and data integration mechanisms adapted to astronomical data characteristics. Instead of providing an ad hoc mediator,

Andromeda

can be configured for giving access to different data sources according to user requirements (data types, content, data quality, and provenance).

Andromeda

can be also adapted when new sources are added or new requirements are specified. Furthermore, in

Andromeda

the data integration process is done in a distributed manner, taking advantage of the available computing resources and reducing data communication costs.

Víctor Cuevas-Vicenttín, José Luis Zechinelli-Martini, Genoveva Vargas-Solar
Improving Web Retrieval Precision Based on Semantic Relationships and Proximity of Query Keywords

Based on recent studies, the most common queries in Web searches involve one or two keywords. While most Web search engines perform very well for a single-keyword query, their precisions is not as good for queries involving two or more keywords. Search results often contain a large number of pages that are only weakly relevant to either of the keywords. One solution is to focus on the proximity of keywords in the search results. Filtering keywords by semantic relationships could also be used. We developed a method to improve the precison of Web retrieval based on the semantic relationships between and proximity of keywords for two-keyword queries. We have implemented a system that re-ranks Web search results based on three measures:

first-appearance term distance

,

minimum term distance

, and

local appearance density

. Furthermore, the system enables the user to assign weights to the new rank and original ranks so that the result can be presented in order of the combined rank. We built a prototype user interface in which the user can dynamically change the weights on two different ranks. The result of the experiment showed that our method improves the precision of Web search results for two-keyword queries.

Chi Tian, Taro Tezuka, Satoshi Oyama, Keishi Tajima, Katsumi Tanaka

Invited Talk DEXA Conference

From Extreme Programming to Extreme Non-programming: Is It the Right Time for Model Transformation Technologies?

Object-Oriented Methods, Formal Specification Languages, Component- Based Software Production... This is just a very short list of technologies proposed to solve a very old and, at the same time, very well-known problem: how to produce software of quality. Programming has been the key task during the last 40 years, and the results have not been successful yet. This work will explore the need of facing a sound software production process from a different perspective: the non-programming perspective, where by non-programming we mainly mean modeling. Instead of talking about Extreme Programming, we will introduce a Extreme Non-Programming (Extreme Modeling-Oriented) approach. We will base our ideas on the intensive work done during the last years, oriented to the objective of generating code from a higher-level system specification, normally represented as a Conceptual Schema. Nowadays, though, the hip around MDA has given a new push to these strategies. New methods propose sound model transformations which cover all the different steps of a sound software production process from an Information Systems Engineering point of view. This must include Organizational Modeling, Requirements Engineering, Conceptual Modeling and Model-Based Code Generation techniques. In this context, it seems that the time of Model Transformation Technologies is finally here...

Óscar Pastor

XML II

Using an Oracle Repository to Accelerate XPath Queries

One of the problems associated with XML databases is the poor performance of XPath queries. Although this has attracted much attention by the research community, solutions are either partial (not providing full XPath functionality) or unable to manage database updates. In this work, we exploit features of Oracle 10g in order to rapidly build indexes that improve the processing times of XML databases. Significantly, we can also support XML database updates as the rebuild time for entire indexes is reasonably fast and thus, provides for flexible update strategies. This paper discusses the process for building the index repository and describes a series of experiments that demonstrate our improved query response times.

Colm Noonan, Cian Durrigan, Mark Roantree
A Relational Nested Interval Encoding Scheme for XML Data

As XML is rapidly becoming the de-facto standard for data representation and exchange in the Internet age, there has been a lot of research on how to store and retrieve XML data in relational databases. However, even though the XML data is mostly tree-structured, the XML research community has shown little attention to the traditional RDBMS-based encoding scheme for tree data. In this paper, we investigate one of the encoding schemes, called Nested Interval, for the storage and retrieval of XML data. In particular, our approach is very robust in updating XML data, including insertion of new node. In fact, the existing RDBMS-based XML storage and indexing techniques work very poorly against XML data update because the XML data should be re-encoded from the scratch for virtually any update in XML data. In contract, Nested Interval scheme does not require re-encoding all nodes. In this respect, our approach is a viable option for storing and querying update-intensive XML application.

Gap-Joo Na, Sang-Won Lee
A Prototype of a Schema-Based XPath Satisfiability Tester

The satisfiability test of queries can be used in query optimization for avoiding submission and computation of unsatisfiable queries. Thus, applying the satisfiability test before executing a query can save evaluation time and transportation costs in distributed scenarios. Therefore, we propose a schema-based approach to the satisfiability test of XPath queries, which checks whether or not an XPath query conforms to the constraints in a given schema. If an XPath query does not conform to the constraints given in the schema, the evaluation of the query will return an empty result for every XML document. Thus, the XPath query is unsatisfiable. We present a complexity analysis of our approach, which proves that our approach is efficient at typical cases. We present an experimental analysis of our developed prototype, which shows the optimization potential of avoiding the evaluation of unsatisfiable queries.

Jinghua Groppe, Sven Groppe

Data and Information II

Understanding and Enhancing the Folding-In Method in Latent Semantic Indexing

Latent Semantic Indexing(LSI) has been proved to be effective to capture the semantic structure of document collections. It is widely used in content-based text retrieval. However, in many real-world applications dealing with very large document collections, LSI suffers from its high computational complexity, which comes from the process of Singular Value Decomposition(SVD). As a result, in practice, the folding-in method is widely used as an approximation to the LSI method. However, in practice, the folding-in method is generally implemented ”as is” and detailed analysis on its effectiveness and performance is left out. Consequentially, the performance of the folding-in method cannot be guaranteed. In this paper, we firstly illustrated the underlying principle of the folding-in method from a linear algebra point of view and analyzed some existing commonly used techniques. Based on the theoretical analysis, we proposed a novel algorithm to guide the implementation of the folding-in method. Our method was justified and evaluated by a series of experiments on various classical IR data sets. The results indicated that our method was effective and had consistent performance over different document collections.

Xiang Wang, Xiaoming Jin
DCF: An Efficient Data Stream Clustering Framework for Streaming Applications

Streaming applications, such as environment monitoring and vehicle location tracking require handling high volumes of continuously arriving data and sudden fluctuations in these volumes while efficiently supporting multi-dimensional historical queries. The use of the traditional database management systems is inappropriate because they require excessive number of disk I/O in continuously updating massive data streams. In this paper, we propose DCF (Data Stream Clustering Framework), a novel framework that supports efficient data stream archiving for streaming applications. DCF can reduce a great amount of disk I/O in the storage system by grouping incoming data into clusters and storing them instead of raw data elements. In addition, even when there is a temporary fluctuation in the amount of incoming data, it can stably support storing all incoming raw data by controlling the cluster size. Our experimental results show that our approach significantly reduces the number of disk accesses in terms of both inserting and retrieving data.

Kyungmin Cho, Sungjae Jo, Hyukjae Jang, Su Myeon Kim, Junehwa Song
Analysis of BPEL and High-Level Web Service Orchestration: Bringing Benefits to the Problems of the Business

As a business evolves, the number of systems used to run the business increases. This can be either through continued development of the IT strategy, or acquisition of other companies with different systems. For any business there is a real requirement to link these systems. Web services and Web service orchestration offer a secure, accessible, scalable and future-proof mechanism for system communication. This linking of disparate systems within a company’s IT strategy contributes towards a key SOA strategy. Generally, a company’s IT strategy is arranged and organised by high-level business managers; not technical IT officers. Due to the fact that SOA implementations are generally a very technical implementation, business personnel are unable to get heavily involved in the process at the present time. This paper presents the concepts behind Service Oriented Architectures and Web service orchestration. A custom software tool for developing SOA strategies via BPEL and Web service orchestration is outlined towards the end of the paper. Based on an established modelling tool, the SOA add-in is capable of producing complex SOA strategies for a company’s IT department. This tool is pitched at a sufficient level for semi-technical business managers to be able to utilise it in an efficient manner. This moves away from the traditional low-level design being carried out by highly skilled IT professionals. We also discuss potential future improvements to the BPEL standard.

Adam Strickland, Dick Whittington, Phil Taylor, Bing Wang

XML III

Rewriting Queries for XML Integration Systems

A data integration system typically creates a target XML schema to represent an application domain and source schemas are mapped to the target schema. A user poses a query over the target schema, and the system rewrites the query into a set of queries over the data sources. Existing algorithms generate a set of static rules based on the target schema and mappings, and rewrite the target query using these rules. We design a flexible and dynamic approach that rewrites XML queries directly based on the mappings between the target and source schemas. Theoretical analysis and experiments on both synthetic and real-world datasets indicate that the proposed approach is efficient and scalable.

Ling Li, Mong Li Lee, Wynne Hsu
A Tale of Two Approaches: Query Performance Study of XML Storage Strategies in Relational Databases

Several recent papers have investigated a relational approach to store XML data and there is a growing evidence that

schema-conscious

approaches are a better option than

schema-oblivious

techniques as far as query performance is concerned. This paper studies three strategies for storing XML document including one representing schema-conscious approach (

Shared-Inlining

) and two representing schema-oblivious approach (XParent and

Sucxent++

). We implement and evaluate each approach using benchmark

non-recursive

XQueries. Our analysis shows an interesting fact that schema-conscious approaches are not always a better option than schema-oblivious approaches! In fact, it is possible for a schema-oblivious approach (

Sucxent++

) to be faster than a schema-conscious approach (

Shared-Inlining

) for 55% of the benchmark queries (the highest observed factor being 87.8 times).

Sucxent++

also outperforms XParent by up to 1700 times.

Sandeep Prakash, Sourav S. Bhowmick
Visual Specification and Optimization of XQuery Using VXQ

As the popularity of XML increases, the need for querying collections of XML data from various systems becomes imperative. Proposed by W3C, XQuery is becoming a standard for querying such systems. However, the complexity of XQuery prevents its usage by broad audience. This paper proposes a visual XQuery specification language called VXQ to address this issue. By intuitive abstractions of XML and XQuery, the proposed system can generate XQueries for users that have little knowledge about the language. We show that our visual language is more expressive than previous proposals. Finally, we extend our proposed visual XQuery to support query rewriting and optimization for multiple XQuery systems.

Ryan H. Choi, Raymond K. Wong, Wei Wang

Data and Information III

MSXD: A Model and a Schema for Concurrent Structures Defined over the Same Textual Data

This work aims at defining a model and a schema for multistructured (noted MS) textual documents. Our objectives are (1) to describe several independent hierarchical structures over the same textual data (represented by several structured documents) (2) to consider annotations added by the user in each structured document and (3) to define weak constrains over the concurrent structures and annotations. Our proposal is based on the use of hedge (the foundation of RelaxNG). It is associated with an algebra defined on the structures and annotations of a document in order to specify constraints between them (by means of Allen’s relations).

Emmanuel Bruno, Elisabeth Murisasco
Estimating Aggregate Join Queries over Data Streams Using Discrete Cosine Transform

Data stream processing is required to be an on-line, one-pass, and time and space efficient process. In this paper, we develop a framework for estimating equi-join query size based on the cosine transform. The discrete cosine transform (DCT) is able to provide concise and accurate representations of data distributions. It can also be updated easily in the presence of insertions and deletions. We have performed analyses and experiments to compare the DCT with sketch-based methods. The experimental results show that given the same amount of space, our method yields more accurate estimates than sketch methods most of the time. Experimental results have also confirmed that the cosine series can be updated quickly to cope with the rapid flow of data.

Zhewei Jiang, Cheng Luo, Wen-Chi Hou, Feng Yan, Qiang Zhu

Datamining and Data Warehouses

Evaluation of a Probabilistic Approach to Classify Incomplete Objects Using Decision Trees

We describe an approach to fill missing values in decision trees during classification. This approach is derived from the ordered attribute trees method, proposed by Lobo and Numao in 2000, which builds a decision tree for each attribute and uses these trees to fill the missing attribute values. Both our approach and theirs are based on the Mutual Information between the attributes and the class. Our method takes the dependence between attributes into account by using the Mutual Information. The result of the classification process is a probability distribution instead of a single class. In this paper, we present tests performed on some real databases using our approach and Quinlan’s method. We analyse the classification results of some instances in test data and finally we discuss some perspectives.

Lamis Hawarah, Ana Simonet, Michel Simonet
Multiway Pruning for Efficient Iceberg Cubing

Effective pruning is essential for efficient iceberg cube computation. Previous studies have focused on exclusive pruning: regions of a search space that do not satisfy some condition are excluded from computation. In this paper we propose inclusive and anti-pruning. With inclusive pruning, necessary conditions that solutions must satisfy are identified and regions that can not be reached by such conditions are pruned from computation. With anti-pruning, regions of solutions are identified and pruning is not applied. We propose the multiway pruning strategy combining exclusive, inclusive and anti-pruning with bounding aggregate functions in iceberg cube computation. Preliminary experiments demonstrate that the multiway-pruning strategy improves the efficiency of iceberg cubing algorithms with only exclusive pruning.

Xiuzhen Zhang, Pauline Lienhua Chou
Mining and Visualizing Local Experiences from Blog Entries

We describe a way to extract visitors’ experiences from Weblogs (blogs) and also a way to mine and visualize activities of visitors at sightseeing spots. A system using our proposed method mines association rules between locations, time periods, and types of experiences out of blog entries. Association rules between experiences are also extracted. We constructed a local information search system that enables the user to specify a location, a time period, or a type of experience in a search query and find relevant Web content. Results of experiments showed that three proposed refinement algorithms applied to a conventional text mining method raises the precision and recall of the extracted rules.

Takeshi Kurashima, Taro Tezuka, Katsumi Tanaka
Mining RDF Metadata for Generalized Association Rules

In this paper, we present a novel frequent generalized pattern mining algorithm, called

GP-Close

, for mining generalized associations from RDF metadata. To solve the

over-generalization

problem encountered by existing methods, GP-Close employs the notion of

generalization closure

for systematic

over-generalization reduction

. Empirical experiments conducted on real world

RDF

data sets show that our method can substantially reduce pattern redundancy and perform much better than the original generalized association rule mining algorithm

Cumulate

in term of time efficiency.

Tao Jiang, Ah-Hwee Tan

Database Applications I

Analysing Social Networks Within Bibliographical Data

Finding relationships between authors and thematic similar publications is getting harder and harder due to the mass of information and the rapid growth of the number of scientific workers. The

io-port.net

portal and the

DBLP

Computer Science Bibliography including more than 2,000,000 and 750,000 publications, respectively, from more than 450,000 authors are major services used by thousands of computer scientists which provides fundamental support for scientists searching for publications or other scientists in similar communities.

In this paper, we describe a user–friendly interface which plays the central role in searching authors and publications and analysing social networks on the basis of bibliographical data.

After introducing the concept of multi-mode social networks, the

DBL–Browser

itself and various methods for multi-layered browsing through social networks are described.

Stefan Klink, Patrick Reuther, Alexander Weber, Bernd Walter, Michael Ley
Automating the Choice of Decision Support System Architecture

Due to the wide-spread use of decision support systems (DSS), methods are required by software companies. Several concepts and methods have been suggested for decision-making. The development of DSS is still complex due to the variety of architectures, the number and the type of modules which compose them.

Therefore, This paper present our method of DSS development, which is based on an architecture of 1 to 4 levels, focusing on the choice of an adapted architecture (composed of data warehouses and data marts). Our method is based on a mixed approach integrating the assessment of existing sources and user requirements. These are taken into account from the early step of DSS engineering, e.g. from the analysis step, so DSS architecture choice integrates them.

Estella Annoni, Franck Ravat, Olivier Teste, Gilles Zurfluh
Dynamic Range Query in Spatial Network Environments

Moving range queries over mobile objects are important in many location management applications. There have been quite a few research works in this area. However, all existing solutions assume an open space environment, which are either not applicable to spatial network environment or require non-trivial extensions. In this paper, we consider a new class of query called

Dynamic Range Query

. A dynamic range query is a moving range query in a network environment, which retrieves the moving objects within a specified network distance of the moving query point. As this query point moves in the network, the footprint (or shape) of the query range changes accordingly to reflect the new relevant query area. Our execution strategy leverages computing power of the moving objects to reduce server load and communication costs. This scheme is particularly desirable for many practical applications such as vehicles in a street environment, where mobile energy is not an issue. We describe the design details and present our simulation study. The performance results indicate that our solution is almost two magnitudes better than a query index method in terms of server load, and requires similar number of messages when compared to a query-blind optimal scheme.

Fuyu Liu, Tai T. Do, Kien A. Hua
Context and Semantic Composition of Web Services

Composition of Web services is a cornerstone step in the development of interoperable systems. However, Web services still face data-heterogeneity challenges, although several attempts of using semantics. In addition, the context in which Web services evolve is still somehow ”ignored”, hampering their adaptability to changes in composition situations. In this paper, we argue how context permits to determine the semantics of interfaces that Web services expose to third parties. We show the need for a context- and semantic-based approach for Web services composition.

Michael Mrissa, Chirine Ghedira, Djamal Benslimane, Zakaria Maamar

XML IV

An Efficient Yet Secure XML Access Control Enforcement by Safe and Correct Query Modification

This work is a proposal for an efficient yet secure XML access control enforcement which has been specifically designed to support fine-grained security policy. Based on metadata in the DTD, we propose the SQ-Filter which is a pre-processing method that checks on necessary access control rules, and rewrites a user’s query by extending/eliminating query tree nodes, and by injecting operators that combine a set of nodes from the user’s query point of view. The scheme has several advantages over other suggested schemes. These include small execution time overhead, and safe and correct query modification. The experimental results clearly demonstrate the efficiency of the approach.

Changwoo Byun, Seog Park
Detecting Information Leakage in Updating XML Documents of Fine-Grained Access Control

To provide fine-grained access control to data in an XML document, XML access control policy is defined based on the contents and structure of the document. In this paper, we discuss confidential information leakage problem caused by unsecure-update that modifies contents or structures of the document referred by the access control policy. In order to solve this problem, we propose an algorithm that computes update constraints of a user on some data in the document under access control policy of the user. We also propose an algorithm that decides whether a given update request of a user against an XML document is an unsecure-update under the user’s access control policy.

Somchai Chatvichienchai, Mizuho Iwaihara
Faster Twig Pattern Matching Using Extended Dewey ID

Finding all the occurrences of a twig pattern in an XML database is a core operation for efficient evaluation of XML queries. Recently, Lu

et al.

[7] proposed the

TJFast

algorithm that uses the

extended Dewey

labelling scheme and reported better performance compared with other

state-of-the-art

holistic twig join algorithms, both in terms of number of elements scanned and stored during the computation. In this paper, we designed an enhancement to further exploit the power of the

extended Dewey ID

. This reduces the CPU cost and also favors indexed inputs. Our algorithm can be shown analytically as efficient as

TJFast

in terms of worst case I/O, and experimentally performs significantly better.

Chung Keung Poon, Leo Yuen

Data and Information IV

A Vector Space Model for Semantic Similarity Calculation and OWL Ontology Alignment

Ontology alignment (or matching) is the operation that takes two ontologies and produces a set of semantic correspondences (usually semantic similarities) between some elements of one of them and some elements of the other. A rigorous, efficient and scalable similarity measure is a pre-requisite of an ontology alignment process. This paper presents a semantic similarity measure based on a matrix represention of nodes from an RDF labelled directed graph. An entity is described with respect to how it relates to other entities using

N

-dimensional vectors, being

N

the number of selected external predicates. We adapt a known graph matching algorithm when applying this idea to the alignment of two ontologies. We have successfully tested the model with the public testcases of the Ontology Alignment Evaluation Initiative 2005.

Rubén Tous, Jaime Delgado
Scalable Automated Service Composition Using a Compact Directory Digest

The composition of services that are indexed in a large-scale service directory often involves many complex queries issued by the service composition algorithm to the directory. These queries may cause considerable processing effort within the directory, thus limiting scalability. In this paper we present a novel approach to increase scalability: The directory offers a compact digest that service composition clients download to solve the hard part of a composition problem locally. In this paper we compare two different digest representations, a sparse matrix and a Zero-Suppressed Reduced Ordered Binary Decision Diagram (ZDD). Our evaluation confirms that both representations are compact and shows that the ZDD enables more efficient service composition.

Walter Binder, Ion Constantinescu, Boi Faltings
Topic Structure Mining for Document Sets Using Graph-Based Analysis

This paper proposes a novel text mining method for a document set based on graph-based analysis. Graph-based analysis first identifies the similarity links in the document set and then determines core documents, those that have the highest level of centrality. Each core document represents a different topic. Next, the centrality scores are used together with the graph structure to identify those documents that are associated with the core documents. This process results in a predetermined number of topics. For each topic the user is presented with a set of documents in three-layer structure: core document, supplemental documents (those that are strongly associated with the core document), and subtopic documents (those that are only slightly associated with the core document and supplemental documents). The user can select any the topics and browse the documents related to that topic. Furthermore, the user can select documents according to the level; for example, subtopic documents are assumed to contain information that differs from the topic indicated and so might be interesting. In analyses of a set of newspaper articles, we evaluate “accuracy of topic identification” and “accuracy of document collecting related to the topics”. Furthermore, we show an example of document set visualization based on graph structure and centrality score; the results indicate the method’s usefulness for browsing and analyzing document sets.

Hiroyuki Toda, Ryoji Kataoka, Hiroyuki Kitagawa

XML V

An Approach for XML Inference Control Based on RDF

In this paper, we present a new approach for XML inference control, which is on the foundation of some improvements of an access control model that based on RDF. By using some concepts that derived from XML, such as XML type, XML object etc, we encapsulate the nodes of an XML document to represent the semantic relations among them. We also represent a method about document combination based on XML keys, which can maintain the structural consistency and content consistency between history files and original documents. Since the range of inference control is enlarged and the granularity of authorized objects is expanded, our approach can provide higher security and flexibility for XML documents.

Li Zhuan, Wang Yuanzhen
Recursive SQL Query Optimization with k-Iteration Lookahead

Relational implementation of recursive queries is part of the ANSI SQL99 and was implemented in Teradata V2R6. Recursive queries allow processing of hierarchical data like air flight schedules, bill-of-materials, data cube dimension hierarchies, and ancestor-descendant information (e.g. XML data stored in relations). The initial Teradata recursive query implementation is based on a static (fixed) execution plan for all recursive iterations. This may not be optimal since the intermediate results from recursive iterations vary in size. To address this problem, this paper proposes dynamic re-optimization techniques to produce execution plans that are optimal for all recursive iterations. The approach employs a mix of multi-iteration pre-planning and dynamic feedback techniques that are generally applicable to any recursive query implementation in an RDBMS. We validate our proposed techniques by conducting experiments on a prototype implementation using airline flights data.

Ahmad Ghazal, Alain Crolotte, Dawit Seid
An Effective, Efficient XML Data Broadcasting Method in a Mobile Wireless Network

With the rapidly increasing popularity of XML, an effective method to transmit XML data over wireless broadcasting environments is urgently required. In this paper, a novel XML data streaming method for wireless broadcasting environments, is proposed. An XML stream is organized to enable a selective access scheme for simple XPath queries, by borrowing the path summary technique, which was originally devised for indexing semi-structured data. In order to utilize structure information as an XML stream index, the structure information and text values of an XML document are separated. In experimental results, the proposed method demonstrates superior performance over previous approaches with regard to both access and tuning time.

Sang-Hyun Park, Jae-Ho Choi, SangKeun Lee

Data and Information V

Formalizing Mappings for OWL Spatiotemporal Ontologies

Ontology mappings provide a common layer which allows distributed applications to share and to exchange semantic information. Providing mechanized ways for mapping ontologies is a challenging issue and main problems to be faced are related to structural and semantic heterogeneity. The complexity of these problems increases in the presence of spatiotemporal information such as geometry and topological intrinsic characteristics. Our proposal is intended for spatiotemporal ontologies and focuses on providing an integrated access to information sources using local ontologies. Our approach is set to build a system that guides users to derive meaningful mappings and to reason about them. To achieve this we use a description logic extended to spatiotemporal concrete domain. The ontology of each source is normalized in a common extended Ontology Web Language (OWL) which enables a natural correspondence with the spatiotemporal description logic formalism.

Nacéra Bennacer
Multi-term Web Query Expansion Using WordNet

In this paper, we propose a method for multi-term query expansions based on WordNet. In our approach,

Hypernym

/

Hyponymy

and

Synonym

relations in WordNet is used as the basic expansion rules. Then we use WordNet Lexical Chains and WordNet semantic similarity to assign terms in the same query into different groups with respect to their semantic similarities. For each group, we expand the highest terms in the WordNet hierarchies with

Hypernym

and

Synonym

, the lowest terms with Hyponym and Synonym, and all other terms with only Synonym. Furthermore, we use collection related term semantic network to remove the low-frequency and unusual words in the expansions. And our experiment reveals that our solution for query expansion can improve the query performance dramatically.

Zhiguo Gong, Chan Wa Cheang, Leong Hou U
Fast Computation of Database Operations Using Content-Addressable Memories

Research efforts on conventional CPU architectures over the past decade have focused primarily on performance enhancement. In contrast, the NPU (Network Processing Unit) architectures have evolved significantly in terms of functionality. The memory hierarchy of a typical network router features a Content-Addressable Memory (CAM) which provides very fast constant-time lookups over large amounts of data and facilitates a wide range of novel high-speed networking solutions such as Packet Classification, Intrusion Detection and Pattern Matching. While these networking applications span an entirely different domain than the database applications, they share a common operation of searching for a particular data entry among huge amounts of data. In this paper, we investigate how CAM-based technology can help in addressing the existing memory hierarchy bottlenecks in database operations. We present several high-speed CAM-based solutions for computationally intensive database operations. In particular, we discuss an efficient linear-time complexity CAM-based sorting algorithm and apply it to develop a fast solution for complex join operations widely used in database applications.

Nagender Bandi, Divyakant Agrawal, Amr El Abbadi
CLEAR: An Efficient Context and Location-Based Dynamic Replication Scheme for Mobile-P2P Networks

We propose CLEAR (Context and Location-based Efficient Allocation of Replicas), a dynamic replica allocation scheme for improving data availability in mobile ad-hoc peer-to-peer (M-P2P) networks. To manage replica allocation efficiently, CLEAR exploits user mobility patterns and deploys a super-peer architecture, which avoids both broadcast storm during replica allocation as well as broadcast-based querying. CLEAR considers different levels of replica consistency and load as replica allocation criteria. Our performance study indicates CLEAR’s overall effectiveness in improving data availability in M-P2P networks.

Anirban Mondal, Sanjay Kumar Madria, Masaru Kitsuregawa

Datamining and Data Warehouses

Lossless Reduction of Datacubes

Datacubes are specially useful for answering efficiently queries on data warehouses. Nevertheless the amount of generated aggregated data is incomparably more voluminous than the initial data which is itself very large. Recently, research work has addressed the issue of a concise representation of datacubes in order to reduce their size. The approach presented in this paper fits in a similar trend. We propose a concise representation, called Partition Cube, based on the concept of partition and define an algorithm to compute it. Various experiments are performed in order to compare our approach with methods fitting in the same trend. This comparison relates to the efficiency of algorithms computing the representations, the main memory requirements, and the storage space which is necessary.

Alain Casali, Rosine Cicchetti, Lotfi Lakhal, Noël Novelli
Multivariate Stream Data Classification Using Simple Text Classifiers

We introduce a classification framework for continuous multivariate stream data. The proposed approach works in two steps. In the preprocessing step, it takes as input a sliding window of multivariate stream data and discretizes the data in the window into a string of symbols that characterize the signal changes. In the classification step, it uses a simple text classification algorithm to classify the discretized data in the window. We evaluated both supervised and unsupervised classification algorithms. For supervised, we tested Naïve Bayes Model and SVM, and for unsupervised, we tested Jaccard, TFIDF, Jaro and JaroWinkler. In our experiments, SVM and TFIDF outperformed the other classification methods. In particular, we observed that classification accuracy is improved when the correlation of attributes is also considered along with the n-gram tokens of symbols.

Sungbo Seo, Jaewoo Kang, Dongwon Lee, Keun Ho Ryu
Location-Based Service with Context Data for a Restaurant Recommendation

Utilizing Global Positioning System (GPS) technology, it is possible to find and recommend restaurants for users operating mobile devices. For recommending restaurants, Personal Digital Assistants or cellular phones only consider the location of restaurants. However, a user’s background and environment information is assumed to be directly related to recommendation quality. In this paper, therefore, a recommender system using context information and a decision tree model for efficient recommendation is presented. This system considers location context, personal context, environment context, and user preference. Restaurant lists are obtained from location context, personal context, and environment context using the decision tree model. In addition, a weight value is used for reflecting user preferences. Finally, the system recommends appropriate restaurants to the mobile user. For this experiment, performance was verified using measurements such as

k

-fold cross-validation and Mean Absolute Error. As a result, the proposed system obtained an improvement in recommendation performance.

Bae-Hee Lee, Heung-Nam Kim, Jin-Guk Jung, Geun-Sik Jo
Cascaded Star: A Hyper-Dimensional Model for a Data Warehouse

A data warehouse is defined as subject-oriented, integrated, time-variant and nonvolatile collection of data. Often, the data representing different subjects is multi-dimensional in nature, where each dimension of each subject could again be multi-dimensional. We refer to this as hyper-dimensional nature of data. Traditional multi-dimensional data models (e.g., the star schema) cannot adequately model these data. This is because, a star schema models one single multi-dimensional subject, hence a complex query crossing different subjects at different dimensional levels has to be specified as multiple queries and the results of each query must be composed together manually. In this paper, we present a novel data model, called the cascaded star model, to model hyper-dimensional data, and propose the cascaded OLAP (COLAP) operations that enable ad-hoc specification of queries that encompass multiple stars. Specifically, our COALP operations include cascaded-roll-up, cascaded-drill-down, cascaded-slice, cascaded-dice and MCUBE. We show that COLAP can be represented by the relational algebra to demonstrate that the cascaded star can be built on top of the traditional star schema framework.

Songmei Yu, Vijayalakshmi Atluri, Nabil Adam

Database Applications II

Using JDOSecure to Introduce Role-Based Permissions to Java Data Objects-Based Applications

The Java Data Objects specification is designed as lightweight persistence approach. Thus, JDO neither supports user authentication nor role-based authorization. Consequently, users are able to query the entire data store as well as to delete persistent objects without any restriction. The novel security approach JDOSecure was developed at the University of Mannheim to prevent unauthorized access to the data store while using the JDO API. Based on the dynamic proxy approach, JDOSecure introduces role-based permissions to JDO-based applications. In this paper we focuses on how JDOSecure enables Java Data Objects-based applications to deal with role-based permissions.

Matthias Merz, Markus Aleksy
A Forced Transplant Algorithm for Dynamic R-tree Implementation

Spatial access methods play a crucial role in spatial database management and manipulation. The R-tree and its variations have been widely accepted as some of the most efficient spatial indexing structures in recent years. However, neither considers storage utilization and the global optimization of a R-tree structure. Presented in this paper is a new optimization technique named forced transplant algorithm, which can improve the node storage utilization and optimize the R-tree overall structures at the same time. Our experiments show that the R-tree with our new optimization technique achieves almost 100% storage utilization and excellent query performance for all types of data.

Mingbo Zhang, Feng Lu, Changxiu Cheng
An Approach for a Personal Information Management System for Photos of a Lifetime by Exploiting Semantics

Photo collections are one of the promising sources to tell story of life in this digital era. In this paper we present our work on organizing photos of a lifetime by exploiting semantic annotations. The complexity in using semantic technology is managed by introducing an annotation template corresponding to who, when, where, and what. Semantics of each dimension are semi-automatically annotated following the ontology for personal information. The story telling is done by exploiting these semantics to form trails of the photos. The notion of landmarks is used for this purpose which also ensures effective navigation in the lifetime photo-space.

Khalid Latif, Khabib Mustofa, A. Min Tjoa
Topic Distillation in Desktop Search

Desktop Search, the search across local storage such as a personal computer, is a common practice among computer users. There has been much activity in Web-related Information Retrieval, but Desktop Search has only recently increased in popularity. As the structure and accessibility of data in a local environment is different to the Web, new algorithmic possibilities arise for Desktop Search.

We apply a connectivity analysis approach to the local environment—a filesystem. We describe how it can be used in parallel with existing tools to provide “more useful” ranked results. Our evaluation reveals that such an approach has promise, and we conclude that exploiting the organization of a filesystem is beneficial for Desktop Search.

Alex Penev, Matthew Gebski, Raymond K. Wong

WWW I

Interactions Between Document Representation and Feature Selection in Text Categorization

Many studies in automated Text Categorization focus on the performance of classifiers, with or without considering feature selection methods, but almost as a rule taking into account just one document representation. Only relatively recently did detailed studies on the impact of various document representations step into the spotlight, showing that there may be statistically significant differences in classifier performance even among variations of the classical bag-of-words model. This paper examines the relationship between the

idf

transform and several widely used feature selection methods, in the context of Naïve Bayes and Support Vector Machines classifiers, on datasets extracted from the

dmoz

ontology of Web-page descriptions. The described experimental study shows that the

idf

transform considerably effects the distribution of classification performance over feature selection reduction rates, and offers an evaluation method which permits the discovery of relationships between different document representations and feature selection methods which is independent of absolute differences in classification performance.

Miloš Radovanović, Mirjana Ivanović
WebDriving: Web Browsing Based on a Driving Metaphor for Improved Children’s e-Learning

A novel approach to Web browsing called "WebDriving" is described that automatically extracts information from the Web and places it in a 3D space, enabling children to easily view the information while “driving through the 3D world”. The user can visualize not only the target Web page, but also the "peripheral information" (that is, linked and other related pages) in the same 3D world. This makes the user aware of other relevant Web pages while browsing the target Web page. Our WebDriving browser is well suited for theme-based “investigative learning”, which is now being promoted at many elementary schools in Japan.

Mika Nakaoka, Taro Tezuka, Katsumi Tanaka
Semantic Wikis for Personal Knowledge Management

Wikis are becoming popular knowledge management tools. Analysing knowledge management requirements, we observe that wikis do not fully support structured search and knowledge reuse. We show how Semantic wikis address the requirements and present a general architecture. We introduce our SemperWiki prototype which offers advanced information access and knowledge reuse.

Eyal Oren, Max Völkel, John G. Breslin, Stefan Decker

Bioinformatics

Integration of Protein Data Sources Through PO

Resolving heterogeneity among various protein data sources is a crucial problem if we want to gain more information about proteomics process. Information from multiple protein databases like PDB, SCOP, and UniProt need to integrated to answer user queries. Issues of Semantic Heterogeneity haven’t been addressed so far in Protein Informatics. This paper outlines protein data source composition approach based on our existing work of Protein Ontology (PO). The proposed approach enables semi-automatic interoperation among heterogeneous protein data sources. The establishment of semantic interoperation over conceptual framework of PO enables us to get a better insight on how information can be integrated systematically and how queries can be composed. The semantic interoperation between protein data sources is based on semantic relationships between concepts of PO. No other such generalized semantic protein data interoperation framework has been considered so far.

Amandeep S. Sidhu, Tharam S. Dillon, Elizabeth Chang
3D Protein Structure Matching by Patch Signatures

For determining functionality dependencies between two proteins, both represented as 3D structures, it is an essential condition that they have one or more matching structural regions called patches. As 3D structures for proteins are large, complex and constantly evolving, it is computationally expensive and very time-consuming to identify possible locations and sizes of patches for a given protein against a large protein database. In this paper, we address a vector space based representation for protein structures, where a patch is formed by the vectors within the region. Based on our previews work, a compact representation of the patch named

patch signature

is applied here. A similarity measure of two patches is then derived based on their signatures. To achieve fast patch matching in large protein databases, a match-and-expand strategy is proposed. Given a query patch, a set of small

k

-sized matching patches, called candidate patches, is generated in

match

stage. The candidate patches are further filtered by enlarging

k

in

expand

stage. Our extensive experimental results demonstrate encouraging performances with respect to this biologically critical but previously computationally prohibitive problem.

Zi Huang, Xiaofang Zhou, Heng Tao Shen, Dawei Song

WWW II

Segmented Document Classification: Problem and Solution

In recent years, structured text documents like XML files are playing an important role in the Web-based applications. Among them, there are some documents that are segmented into different sections like “title”,“body”, etc. We call them “segmented documents”. To classify segmented documents, we can treat them as bags of words and use well-developed text classification models. However different sections in a segmented document may have different impact on the classification result. It is better to treat them differently in the classification process. Following this idea, two algorithms: IN_MIX and OUT_MIX are designed to label segmented documents by a trained classifier. We perform our algorithms using four frequently used models: SVM, NaiveBayes, Regression and Instance-based Classifiers. According to the experiment on Reuters-21578, the performance of different classification models is improved comparing to the conventional bag of words method.

Hang Guo, Lizhu Zhou
User Preference Modeling Based on Interest and Impressions for News Portal Site Systems

We have developed an application called My Portal Viewer (MPV) [1] that effectively integrates many articles collected from multiple news sites and presents these integrations through a familiar interface such as a page the user has often visited. MPV dynamically determines keywords of interest that a user might potentially be interested in based on the history of the articles the user has read and creates categories based on these interest words. MPV and many other similar integration systems, however, cause problems where users cannot find only their interest articles in each category because they are only ranked by frequency and the cooccurrence of keywords. We propose a new method of selecting further articles from each category using a user’s impressions of articles. The improved MPV, called MPV Plus, selects and recommends more desirable articles using the method we propose. This paper presents the design concept and process flow of MPV Plus and reports on its effectiveness as evaluated in experiments.

Yukiko Kawai, Tadahiko Kumamoto, Katsumi Tanaka
Cleaning Web Pages for Effective Web Content Mining

Classifying and mining noise-free web pages will improve on accuracy of search results as well as search speed, and may benefit web-page organization applications (e.g., keyword-based search engines and taxonomic web page categorization applications). Noise on web pages are irrelevant to the main content on the web pages being mined, and include advertisements, navigation bar, and copyright notices. The few existing work on web page cleaning detect noise blocks with exact matching contents but are weak at detecting near duplicate blocks, characterized by items like navigation bars.

This paper proposes a system, WebPageCleaner, for eliminating noise blocks from web pages for purposes of improving the accuracy and efficiency of web content mining. A vision-based technique is employed for extracting blocks from web pages. Then, relevant web page blocks are identified as those with high importance level by analyzing such physical features of the blocks as the block location, percentage of web links on the block, and level of similarity of block contents to other blocks. Important blocks are exported to be used for web content mining using Naive Bayes text classification. Experiments show that WebPageCleaner leads to a more accurate and efficient web page classification results than comparable existing approaches.

Jing Li, C. I. Ezeife

Process Automation and Workflow

An Applied Optimization Framework for Distributed Air Transportation Environments

In a large-scale dynamic system with multiple distributed entities, each with their own set of interests, there is a need to find a globally acceptable and optimal solution state. This solution state is, by definition, efficient to all entities with respect to their own individual goals and to the system as a whole. In these dynamic environments, this solution state can be achieved by utilizing software techniques from the field of game theory in order to make optimal decisions. We present an application built upon a generalized optimization framework that can be applied to a number of domains, such as cargo or network traffic algorithms. In this research, we used a market-based approach to air traffic flow management through a modeling and simulation environment. The aim is to allow individual aircraft a certain degree of local autonomy, much like cars on a highway. Our system is able to cope in real time with failures such as node loss and adjust system parameters accordingly to optimize results based on the goals of the involved agents. We describe tradeoffs between different agent interaction frameworks with respect to their performance in market mechanism auctions. We also discuss lessons learned while implementing this application. This research has built upon our previously reported work [20, 21] on route optimizations and airspace sector design in an air traffic control network, by adding in the goals of interested entities, e.g. airlines, aircraft, and airports, maximizing the “payoff” to each player (agent). It is intended that the results of our work will be directly used in this domain. In addition, we envision our work being leveraged for other optimization tasks such as data traffic on a network, first responder / disaster relief efforts, and other tasks where rapid solving of large-scale optimization problems is essential.

Thomas Castelli, Joshua Lee, Waseem Naqvi
On the Completion of Workflows

Workflow Management Systems (WFMS) coordinate execution of logically related multiple tasks in an organization. A workflow schema is defined using a set of tasks that are coordinated using dependencies. Workflows instantiated from the same schema may differ with respect to the tasks executed. An important issue that must be addressed while designing a workflow is to decide what tasks are needed for the workflow to complete – we refer to this set as the

completion set

. Since different tasks are executed in different workflow instances, a workflow schema may be associated with multiple completion sets. Incorrect specification of completion sets may prohibit some workflow from completing. Manually generating these sets for large workflow schemas can be an error-prone and tedious process. Our goal is to automate this process. We investigate the factors that affect the completion of a workflow. Specifically, we study the impact of control-flow dependencies on completion sets and show how this knowledge can be used for automatically generating these sets. Finally, we provide an algorithm that can be used by application developers to generate the completion sets associated with a workflow schema.

Tai Xin, Indrakshi Ray, Parvathi Chundi, Sopak Chaichana
Concurrency Management in Transactional Web Services Coordination

The Business Process Execution Language BPEL4WS has emerged to introduce process dimension in Web Services coordination. At the same time, a lot of needs related to business process management appeared. In this article we focus on transactional management in Web Services platforms. WS-Transaction specification had a big impact on usage of Web Services in critical situations such as financial services. This usage of transactions in web services coordination also introduced concurrency problems similar to those encountered in transactional databases world due to hard transactional constraints especially for isolation mechanisms. Today, WS-Transactions provide flexible atomicity in Web Services coordination (WS-BusinessActivity) but isolation flexibility is not provided. Isolation mechanisms used today are not really adapted to Service Oriented environments and we aim to make them more ‘process friendly’. In this paper, we focus on this important part of concurrency problems and propose a new view of WS-Transactions based on Behavioural Spheres approach. This contribution suggests a reorganisation of the WS-Coordination framework adding WS-IsolationSphere for isolation management and the WS-Sphere coordination type for generalising any behaviour management in Web Services coordination.

Adnene Guabtni, François Charoy, Claude Godart
Acquisition of Process Descriptions from Surgical Interventions

The recording and analysis of process descriptions from running surgical interventions is a very new and promising field named

Surgical Workflows

. Surgical Workflows fulfill two major objectives: they form the base of scientific evaluation and rapid prototyping of surgical assist systems, and they pave the road for the entering of workflow management systems into the operating room for intraoperative support of the surgeon. In this paper we describe how process descriptions from surgical interventions can be obtained for Surgical Process Modelling (SPM) as a specific domain of Business Process Modelling (BPM). After the introduction into the field of Surgical Workflows and the motivation of the research efforts, we deal with theoretical considerations about surgical interventions and the identification of classifications. Based on that, we propose the extendable structure for computational data acquisition support and conclude with use cases. The presented approach was applied to more than 200 surgical interventions of 10 different intervention types from otorhinolaryngology, neurosurgery, heart surgery, eye surgery, and interventional radiology, and it represents an ongoing project.

Thomas Neumuth, Gero Strauß, Jürgen Meixensberger, Heinz U Lemke, Oliver Burgert

Knowledge Management and Expert Systems

Adaptive Policies in Information Lifecycle Management

Adaptive policies contain parameters and take into consideration performance feedback to modify values of these parameters adaptively. We propose a method (applicable to the domain of Information Lifecycle Management) to automatically adapt these parameters. Design issues such as selection of sensors, desired ranges for sensors and effects of parameter sensitivity such as over-correction and under-correction are discussed.

Rohit M. Lotlikar, Mukesh Mohania
Implementation and Experimentation of the Logic Language ${\cal N\!P\, D}atalog$

This paper presents a system prototype implementing

${\cal NP~ D}atalog$

, a Datalog-like language for expressing

$\cal NP$

search and optimization problems.

${\cal NP~ D}atalog$

extends

DATALOG

$^{\neg_s}$

(

DATALOG

with stratified negation) with intuitive and efficient constructs, i.e. constraints and a restricted form of (exclusive) disjunction used to define (nondeterministically) subsets (or partitions) of relations. The system translates

${\cal NP~ D}atalog$

queries into OPL programs, then solves them by using the ILOG Solver [16]. Thus, it combines an easy formulation of problems, expressed by means of a declarative logic language, and an efficient execution of the ILOG Solver. Several experiments show the effectiveness of this approach.

S. Greco, C. Molinaro, I. Trubitsyna
Converting a Naive Bayes Models with Multi-valued Domains into Sets of Rules

Nowadays, several knowledge representation methods are being used in knowledge based systems, machine learning, and data mining. Among them are decision rules and Bayesian networks. Both methods have specific advantages and disadvantages. A conversion method would allow to exploit advantages of both techniques. In this paper an algorithm that converts Naive Bayes models with multi-valued attribute domains into sets of rules is proposed. Experimental results show that it is possible to generate rule-based classifiers, which have relatively high accuracy and are simpler than original models.

Bartłomiej Śnieżyński
Hypersphere Indexer

Indexing high-dimensional data for efficient nearest-neighbor searches poses interesting research challenges. It is well known that when data dimension is high, the search time can exceed the time required for performing a linear scan on the entire dataset. To alleviate this

dimensionality curse

, indexing schemes such as

locality sensitive hashing

(LSH) and

M-trees

were proposed to perform

approximate

searches. In this paper, we propose a

hypersphere indexer

, named

Hydex

, to perform such searches.

Hydex

partitions the data space using concentric hyperspheres. By exploiting geometric properties,

Hydex

can perform effective pruning. Our empirical study shows that

Hydex

enjoys three advantages over competing schemes for achieving the same level of search accuracy. First,

Hydex

requires fewer seek operations. Second,

Hydex

can maintain sequential disk accesses most of the time. And third, it requires fewer distance computations.

Navneet Panda, Edward Y. Chang, Arun Qamra

Database Theory I

Distributed Continuous Range Query Processing on Moving Objects

Recent work on continuous queries has focused on processing queries in very large, mobile environments. In this paper, we propose a system leveraging the computing capacities of mobile devices for continuous range query processing. In our design, continuous range queries are mainly processed on the mobile device side, which is able to achieve real-time updates with minimum server load. Our work distinguish itself from previous work with several important contributions. First, we introduce a distributed server infrastructure to partition the entire service region into a set of service zones and cooperatively handle requests of continuous range queries. This feature improves the robustness and flexibility of the system by adapting to a time-varying set of servers. Second, we propose a novel query indexing structure, which records the difference of the query distribution on a grid model. This approach significantly reduce the size and complexity of the index so that in-memory indexing can be achieved on mobile objects with constrained memory size. We report on the rigorous evaluation of our design, which shows substantial improvement in the efficiency of continuous range query processing in mobile environments.

Haojun Wang, Roger Zimmermann, Wei-Shinn Ku
Optimal Route Determination Technology Based on Trajectory Querying Moving Object Database

Recently, the LBS (Location-based services), which make use of location information of moving objects, have obtained increasingly high attention. We can store and manage the trajectories of moving objects such as vehicles by querying moving object database management system. Therefore, it may be used to deliver better services to users by past trajectory information. In this paper, we describe a novel method that determines an optimal route by querying the trajectory of moving objects stored in moving object database system. In general, the route determination algorithms are not proper in real world because these make use of only static properties of road segments. However, our approach can determine the optimal route using dynamic attributes such as passing time and real speed of road segments extracted from the trajectories of moving objects. So we can provide more optimal route than results of conventional route determination methods without traffic information gathering devices.

Kyoung-Wook Min, Ju-Wan Kim, Jong-Hyun Park
Efficient Temporal Coalescing Query Support in Relational Database Systems

The interest in, and user demand for, temporal databases have only increased with time; unfortunately, DBMS vendors and standard groups have not moved aggressively to extend their systems with support for transaction-time or valid-time. This can be partially attributed to the expected major R&D costs to add temporal support to RDBMS by directly extending the database engine. The newly introduced SQL:2003 standards have actually significantly enhanced our ability to support temporal applications in commercial database systems. The long recognized problem of coalescing, which is difficult to support in the framework of SQL:1992, can now be effectively supported in RDBMS. In this paper, we investigate alternatives of temporal coalescing queries under temporal data models in RDBMS. We provide an SQL:2003-based query algorithm and a native relational user defined aggregates (UDA) approach – both approaches only require a single scan of the database. We conclude that temporal queries can be best supported by OLAP functions supported in the current SQL:2003 standards. These new findings demonstrate that the current RDBMS are mature enough to directly support efficient temporal queries, and provide a new paradigm for temporal database research and implementation.

Xin Zhou, Fusheng Wang, Carlo Zaniolo

Query Processing I

Efficient Evaluation of Partially-Dimensional Range Queries Using Adaptive R*-tree

This paper is about how to efficiently evaluate partially-dimensional range queries, which are often used in many actual applications. If the existing multidimensional indices are employed to evaluate partially-dimensional range queries, then a great deal of information that is irrelevant to the queries also has to be read from disk. A modification of R*-tree is described in this paper to ameliorate such a situation. Discussions and experiments indicate that the proposed modification can clearly improve the performance of partially-dimensional range queries, especially for large datasets.

Yaokai Feng, Akifumi Makinouchi
Parallelizing Progressive Computation for Skyline Queries in Multi-disk Environment

Given a set of

d

-dimensional points, skyline query returns the points that are not dominated by any other point on all dimensions. In this paper, we study an interesting scenario of skyline retrieval, where multi-dimensional points are distributed among multiple disks. Efficient algorithms for parallelizing progressive skyline computation are developed, using the parallel R-trees. The core of our scheme is to visit more entries from some disks simultaneously and enable effective pruning strategies with dominance checking to prune away the non-qualifying entries. Extensive experiments with synthetic data confirm that our proposed algorithms are both efficient and scalable.

Yunjun Gao, Gencai Chen, Ling Chen, Chun Chen
Parameterizing a Genetic Optimizer

Genetic programming has been proposed as a possible although still intriguing approach for query optimization. There exist two main aspects which are still unclear and need further investigation, namely, the quality of the results and the speed to converge to an optimum solution. In this paper we tackle the first aspect and present and validate a statistical model that, for the first time in the literature, lets us state that the average cost of the best query execution plan (QEP) obtained by a genetic optimizer is predictable. Also, it allows us to analyze the parameters that are most important in order to obtain the best possible costed QEP. As a consequence of this analysis, we demonstrate that it is possible to extract general rules in order to parameterize a genetic optimizer independently from the random effects of the initial population.

Victor Muntés-Mulero, Marta Pérez-Casany, Josep Aguilar-Saborit, Calisto Zuzarte, Josep-Ll. Larriba-Pey

Database Theory II

Interpolating and Using Most Likely Trajectories in Moving-Objects Databases

In recent years, many emerging database applications deal with large sets of continuously moving data objects. Since no computer system can commit continuously occurring infinitesimal changes to the database, related data management techniques view a moving object’s trajectory as a sequence of discretely reported spatiotemporal points. For each pair of consecutive committed trajectory points, a spatiotemporal uncertainty region representing all possible in-between trajectory points is defined. To support trajectory queries with a non-uniform probability distribution model, the query system needs to compute (interpolate) the “most likely” trajectories in the uncertainty regions to determine the peak points of the probability distributions. This paper proposes a generalized trajectory interpolation model using parametric trajectory representations. In addition, the paper expands and investigates three practical specializations of our proposed model using a moving object with momentum, i.e., a vehicle, as the exemplar.

Byunggu Yu, Seon Ho Kim
Relaxing Constraints on GeoPQL Operators to Improve Query Answering

In this paper the problem of matching a query with imprecise or missing data is analyzed for geographic information systems, and an approach for the relaxation query constraints is proposed. This approach, similarly with the 9-intersection matrix between two sets of points presented in [1] [2] [3], proposes the representation of two symbolic graphical objects (SGO) in terms of interior, boundary, and exterior points, applied to each of the configurations between two objects. The paper distinguishes the different configurations considering the results obtained from the 9 intersections, not only by whether results are null or not null. This approach allows a more detailed similarity graph to be defined. The aim of the proposed methodology is to relax geographical query constraints, in order to obtain meaningful answers for imprecise or missing data. In this paper the polyline-polygon case is discussed in detail.

Arianna D’Ulizia, Fernando Ferri, Patrizia Grifoni, Maurizio Rafanelli
High-Dimensional Similarity Search Using Data-Sensitive Space Partitioning

Nearest neighbor search has a wide variety of applications. Unfortunately, the majority of search methods do not scale well with dimensionality. Recent efforts have been focused on finding better approximate solutions that improve the locality of data using dimensionality reduction. However, it is possible to preserve the locality of data and find exact nearest neighbors in high dimensions without dimensionality reduction. This paper introduces a novel high-performance technique to find exact

k

-nearest neighbors in both low and high dimensional spaces. It relies on a new method for data-sensitive space partitioning based on explicit data clustering, which is introduced in the paper for the first time. This organization supports effective reduction of the search space before accessing secondary storage. Costly Euclidean distance calculations are reduced through efficient processing of a lightweight memory-based filter. The algorithm outperforms sequential scan and the VA-File in high-dimensional situations. Moreover, the results with dynamic loading of data show that the technique works well on dynamic datasets as well.

Sachin Kulkarni, Ratko Orlandic

Query Processing II

Truly Adaptive Optimization: The Basic Ideas

A new approach to query optimization, truly adaptive optimization (TAO), is presented. TAO is a general optimization strategy and is composed of three elements:

1. a fast solution space search algorithm, derived from A*, which uses an informed heuristic lookahead;

2. a relaxation technique which allows to specify a tolerance on the quality of the resulting query execution plan;

3. a paradigm to prove the suboptimality of search subspaces. Non-procedural pruning rules can be used to describe specific problem knowledge, and can be easily added to the optimizer, as the specific problem becomes better understood.

The main contribution over previous research is the use of relaxation techniques and that TAO provides a unifying framework for query optimization problems, which models a complexity continuum going from fast heuristic searches to exponential optimal searches while guaranteeing a selected plan quality. In addition, problem knowledge can be exploited to speed the search up. As a preliminary example, the method is applied to query optimization for databases distributed over a broadcast network. Simulation results are reported.

Giovanni Maria Sacco
Applying Cosine Series to XML Structural Join Size Estimation

As XML has become the de facto standard for data presentation and exchanging on the Web, XML query optimization has emerged as an important research issue. It is widely accepted that structural joins, which evaluate the containment (ancestor-descendant) relationships between XML elements, are important to the XML query processing. Estimating structural join size accurately and quickly thus becomes crucial to the success of XML query plan selection. In this paper, we propose to apply Cosine transform to structural join size estimation. Our approach captures structural information of XML data using mathematical functions, which are then approximated by the Cosine series. We derive a simple formula to estimate the structural join size using the Cosine series. Theoretical analyses and extensive experiments have been performed. The experimental results show that, compared with state-of-the-art IM-DA-Est method, our method is several order faster, requires less memory, and yields better or comparable estimates.

Cheng Luo, Zhewei Jiang, Wen-Chi Hou, Qiang Zhu, Chih-Fang Wang
On the Query Evaluation in Document DBs

In this paper, we study the query evaluation in document databases. First, we show that a query represented in an XML language can be generally considered as a la beled tree, and the evaluation of such a query is in fact a tree embedding problem. Then, we propose a strategy to solve this problem, based on dynamic programming. For the ordered tree embedding, the proposed algorithm needs only O(|

T

|(|

P

|) time and O(|

T

|(|

P

|) space, where |

T

| and |

P

| stands for the numbers of the nodes in the target tree

T

and the pattern tree

P

, respectively. This computational complexity is better than any existing method on this issue. In addition, how to adapt this method to the general tree embedding is also discussed.

Yangjun Chen
A Novel Incremental Maintenance Algorithm of SkyCube

Skyline query processing has recently received a lot of attention in database community. And reference [1] considers the problem of efficiently computing a SkyCube, Which consists of skylines of all possible non-empty subsets of a given set of dimensions. However, the SkyCube is can not use further as original data set is changed. In this paper, we propose a novel incremental maintenance algorithm of SkyCube, called

IMASCIR

.

IMASCIR

splits the maintenance work into two phases: identify and refresh. All the materialized SkyCube views share two tables which stores the net change to the view due to the change to the original data set. In the phase of identify, we identify and store the source changes into these shared tables. Then in the phase of refresh, each materialized view is refreshed individually by applying these two shared tables. Furthermore, our experiment demonstrated that

IMASCIR

is both efficient and effective.

Zhenhua Huang, Wei Wang

Database Theory III

Probabilistic Replication Based on Access Frequencies in Unstructured Peer-to-Peer Networks

Recently, there has been increasing interest in research on data sharing through peer-to-peer networks. In this paper, assuming a data-sharing service, we propose a new replication strategy that achieves not only load balancing but also improvement of search performance. The proposed strategy creates replicas at each peer on the path along which a query is successfully forwarded, depending on the peer’s rank of access frequency to the data. Moreover, we verify the effectiveness of the proposed strategy by simulation experiments.

Takahiro Hara, Yuki Kido, Shojiro Nishio
Role-Based Serializability for Distributed Object Systems

In the role-based access control model, a role is a set of access rights. A subject doing jobs is granted roles showing the jobs in an enterprise. A transaction issued by a subject is associated with a subset of roles granted to the subject, which is named

purpose

. A method with a more significant purpose is performed before another method with a less significant purpose. We discuss which purpose is more significant than another purpose. We discuss two types of role-ordering (RO) schedulers SRO and PRO where multiple conflicting transactions are serializable in the significant order of subjects and purposes, respectively. We evaluate the RO schedulers compared with the traditional two-phase locking protocol in terms of throughput.

Youhei Tanaka, Tomoya Enokido, Makoto Takizawa
MDSSF – A Federated Architecture for Product Procurement

A new architecture of database federation called the MDSSF (Multiple Database Search Service Federation) is presented to support the procurement activities of the AEC (Architecture, Engineering and Construction) industry projects. In order to make procurement decisions, a contractor requires access to product information from several different product suppliers when constructing artefacts such as a hospital, or an office block. This product information is available from the online systems of product suppliers. However, this approach requires a contractor to visit several websites in order to find the right product which is time consuming and the product data available from different product suppliers is heterogeneous. The MDSSF architecture provides an integrated means of accessing product information from a large number of product suppliers using a single system. It brings together autonomous product suppliers to share product information with the federation users such as contractors and potential buyers using a common data model. It also creates an environment for product suppliers to compete with each other in a virtual market place based on the product information they provide to federation users. The MDSSF gives its users a Grid enabled database search mechanism for searching a large number of supplier databases in real time and protects product related sensitive data from exposure to business competitors. We describe the architecture and distinctive features of the MDSSF.

Jaspreet Singh Pahwa, Pete Burnap, W. A. Gray, John Miles

Knowledge Management and Expert Systems

Argumentation for Decision Support

In this paper we describe an application based on a general approach towards modelling practical reasoning through defeasible argumentation. The purpose of the paper is to show how the incorporation of an argumentation component can add value to a collection of existing information agents. The example application is a system for reasoning about the medical treatment of a patient. An agent, called the

Drama

agent, orchestrates a number of information sources to supply a set of arguments on the basis of which the decision regarding treatment can be taken. We describe the general approach and its instantiation for this application, and illustrate the operation of the system with a running example.

Katie Atkinson, Trevor Bench-Capon, Sanjay Modgil
Personalized Detection of Fresh Content and Temporal Annotation for Improved Page Revisiting

Page revisiting is a popular browsing activity in the Web. In this paper we describe a method for improving page revisiting by detecting and highlighting the information on browsed Web pages that is fresh for a user. Content freshness is determined based on comparison with the previously viewed versions of pages. Any new content for the user is marked, enabling the user to quickly spot it. We also describe a mechanism for visually informing users about the degree of freshness of linked pages. By indicating the freshness level of content on linked pages, the system enables users to navigate the Web more effectively. Finally, we propose and demonstrate the concept of determining user-dependent, subjective age of page contents. Using this method, elements of Web pages are annotated with dates indicating the first time the elements were accessed by the user.

Adam Jatowt, Yukiko Kawai, Katsumi Tanaka
Clustering of Search Engine Keywords Using Access Logs

It the becomes possible that users can get kinds of information by just inputting search keyword(s) representing the topic which users are interested in. But it is not always true that users can hit upon search keyword(s) properly. In this paper, by using Web access logs (called panel logs), which are collected URL histories of Japanese users (called panels) selected without static deviation similar to the survey on TV audience rating, we study the methods of clustering search keywords. Different from the existing systems where the related search keywords are extracted based on the set of URLs viewed by the users after input of their original search keyword(s), we propose two novel methods of clustering the search words. One is based on the Web communities (set of similar web pages); the other is based on the set of nouns obtained by morphological analysis of Web pages. According to evaluation results, our proposed methods can extract more related search keywords than that based on URL.

Shingo Otsuka, Masaru Kitsuregawa

Database Theory IV

Non-metric Similarity Ranking for Image Retrieval

Over many years, almost all research work in the content-based image retrieval has used Minkowski distance (or

L

p

-norm) to measure similarity between images. However such functions cannot adequately capture the aspects of the characteristics of the human visual system. In this paper, we present a new similarity measure reflecting the nonlinearity of human perception. Based on this measure, we develop a similarity ranking algorithm for effective image retrieval. This algorithm exploits the inherent cluster structure revealed by an image dataset. Our method yields encouraging experimental results on a real image database and demonstrates its effectiveness.

Guang-Ho Cha
An Effective Method for Approximating the Euclidean Distance in High-Dimensional Space

It is crucial to compute the

Euclidean distance

between two vectors efficiently in high-dimensional space for multimedia information retrieval. We propose an effective method for approximating the Euclidean distance between two high-dimensional vectors. For this approximation, a previous method, which simply employs

norms

of two vectors, has been proposed. This method, however, ignores the

angle

between two vectors in approximation, and thus suffers from large approximation errors. Our method introduces an additional vector called a

reference vector

for estimating the angle between the two vectors, and approximates the Euclidean distance accurately by using the estimated angle. This makes the approximation errors reduced significantly compared with the previous method. Also, we formally prove that the value approximated by our method is always smaller than the actual Euclidean distance. This implies that our method does not incur any

false dismissal

in multimedia information retrieval. Finally, we verify the superiority of the proposed method via performance evaluation with extensive experiments.

Seungdo Jeong, Sang-Wook Kim, Kidong Kim, Byung-Uk Choi
Dynamic Method Materialization: A Framework for Optimizing Data Access Via Methods

This paper addresses the problem of selecting right methods for materialization. In our approach, the selection of methods is based on method execution statistics that are constantly gathered in our prototype system. Based on the statistics, the prototype selects methods whose materialization is profitable. The proposed mechanism was extensively evaluated by experiments whose results are shown in this paper.

Robert Wrembel, Mariusz Masewicz, Krzysztof Jankiewicz

Privacy and Security

Towards an Anti-inference (K, ℓ)-Anonymity Model with Value Association Rules

As a privacy-preserving microdata publication model, K-Anonymity has some application limits, such as (1) it cannot satisfy the individual-defined

k

mechanism requirement, and (2) it is attached with a certain extent potential privacy disclosure risk on published microdata, i.e. existing high-probability inference violations under some prior knowledge on k-anonymized microdata that can surely result in personal private information disclosure. We propose the (

k

, ℓ)-

anonymity

model with data generalization approach to support more flexible and anti-inference k-anonymization on a tabular microdata, where

k

indicates the anonymization level of an identifying attribute cluster and ℓ refers to the diversity level of a sensitive attribute cluster on a record. Within the model,

k

and ℓ are designed on each record and they can be defined subjectively by the corresponding individual. Beside, the model can prevent two kinds of inference attacks for microdata publication, (1) inferring identifying attributes values when their value domains are known; (2) inferring sensitive attributes values with respect to some value associations in the microdata. Further, we propose an algorithm to describe the k-anonymization process in the model. Finally, we take a scenario to illustrate its feasibility, flexibility, and generality.

Zude Li, Guoqiang Zhan, Xiaojun Ye
Analysis of the Power Consumption of Secure Communication in Wireless Networks

With the growth of the Internet, communication and network security have been the focus of much attention. In addition, deployment of resource intensive security protocols in battery-powered mobile devices has raised power consumption to a significant design basis of network design. In this paper, we propose a power-efficient secure communication restart mechanism for a wireless network and analyze the power consumed while restarting a secure communication. An experimental test bed was developed to inspect the proposed mechanism and to evaluate it in terms of power consumption relative to that of conventional secure communication restart mechanisms. Using our enhanced mechanism, we were able to reduce the power consumed during a secure communication restart by up to 60% compared with conventional restart mechanisms.

Kihong Kim, Jinkeun Hong, Jongin Lim
Implementing Authorization Delegations Using Graph

Graph-based approach to access control models have been studied by researchers due to its visualization, flexible representation and precise semantics. In this paper, we present a detailed graph-based algorithm to evaluate authorization delegations and resolve conflicts based on the shorter weighted path-take-precedence method. The approach makes it possible for administrators to control their granting of authorizations in a very flexible way. The correctness proof and time complexity of the algorithm are provided. We then consider how the authorization state can be changed, since in a dynamic environment an authorization state is not static. The detailed algorithm of state transformation and its correctness proof are also given.

Chun Ruan, Vijay Varadharajan
Modeling and Inferring on Role-Based Access Control Policies Using Data Dependencies

Role-Based Access Control (RBAC) models are becoming a de facto standard, greatly simplifying management and administration tasks. Organizational constraints were introduced (e.g.: mutually exclusive roles, cardinality, prerequisite roles) to reflect peculiarities of organizations. Thus, the number of rules is increasing and policies are becoming more and more complex: understanding and analyzing large policies in which several security officers are involved can be a tough job. There is a serious need for administration tools allowing analysis and inference on access control policies. Such tools should help security officers to avoid defining conflicting constraints and inconsistent policies.

This paper shows that theoretical tools from relational databases are suitable for expressing and inferring on RBAC policies and their related constraints. We focused on using Constrained Tuple-Generating Dependencies (CTGDs), a class of dependencies which includes traditional other ones. We show that their great expressive power is suitable for all practical relevant aspects of RBAC. Moreover, proof procedures have been developed for CTGDs: they permit to reason on policies. For example, to check their consistency, to verify a new rule is not already implied or to check satisfaction of security properties. A prototype of RBAC policies management tool has been implemented, using CTGDs dedicated proof procedures as the underlying inference engine.

Romuald Thion, Stéphane Coulondre

Database Theory V

Multi-dimensional Dynamic Bucket Index Based on Mobile Agent System Architecture

This paper describes the idea of a bucket index employed for processing of intensive reading streams coming from huge telemetry networks. This data structure answers approximate spatio-temporal range queries concerning utility usage in user selected region and time. The index structure continuously adjusts to data distribution changes and, as opposed to traditional indexing methods, is capable of processing of the updates on the fly. A stochastic prediction model is also used to estimate utility usage in the near future. The presented indexing technique is implemented in a distributed system based on mobile agents. The mobile architecture is used to control the workload of network hosts.

Marcin Gorawski, Adam Dyga
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS

Geographic information systems (GIS) must support large georeferenced data sets. Due to the size of these data sets finding exact answers to spatial queries can be very time consuming. We present an incremental refining spatial join algorithm that can be used to report query result estimates while simultaneously provide incrementally refined confidence intervals for these estimates. Our approach allows for more interactive data exploration. While similar work has been done in relational databases, to the best of our knowledge this is the first work using this approach in GIS. We investigate different sampling methodologies and evaluate them through extensive experimental performance comparisons. Experiments on real and synthetic data show an order of magnitude response time improvement relative to the exact answer obtained when using the R-tree join.

Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger
Extensions to Stream Processing Architecture for Supporting Event Processing

Both event and stream data processing models have been researched independently and are utilized in diverse application domains. Although they complement each other in terms of their functionality, there is a critical need for their synergistic integration to serve newer class of pervasive and sensor-based monitoring applications. For instance, many advanced applications generate interesting simple events as a result of stream processing that need to be further composed and detected for triggering appropriate actions. In this paper, we present EStream, an approach for integrating event and stream processing for monitoring changes on stream computations and for expressing and processing complex events on continuous queries (CQs). We introduce masks for reducing uninteresting events and for detecting events correctly and efficiently. We discuss stream modifiers, a special class of stream operators for computing changes over stream data. We also briefly discuss architecture and functional modules of EStream.

Vihang Garg, Raman Adaikkalavan, Sharma Chakravarthy
Backmatter
Metadaten
Titel
Database and Expert Systems Applications
herausgegeben von
Stéphane Bressan
Josef Küng
Roland Wagner
Copyright-Jahr
2006
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-37872-3
Print ISBN
978-3-540-37871-6
DOI
https://doi.org/10.1007/11827405