Skip to main content

2005 | Buch

Database and Expert Systems Applications

16th International Conference, DEXA 2005, Copenhagen, Denmark, August 22-26, 2005. Proceedings

herausgegeben von: Kim Viborg Andersen, John Debenham, Roland Wagner

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

DEXA 2005, the 16th International Conference on Database and Expert Systems Applications, was held at the Copenhagen Business School, Copenhagen, Denmark, from August 22 to 26, 2005. The success of the DEXA series has partly been due to the way in which it has kept abreast of recent developments by spawning specialized workshops and conferences each with its own proceedings. In 2005 the DEXA programme was co-located with the 7th International Conference on Data Warehousing and Knowledge Discovery [DaWaK 2005], the 6th International Conference on Electronic Commerce and Web Technologies [EC-Web 2005], the 4th International Conference on Electronic Government [EGOV 2005], the 2nd International Conference on Trust, Privacy, and Security in Digital Business [TrustBus 2005], the 2nd International Conference on Industrial Applications of Holonic and Multi-agent Systems [HoloMAS 2005], as well as 19 specialized workshops. These proceedings are the result of a considerable amount of hard work. Beginning with the preparation of submitted papers, the papers went through the reviewing process. This process was supported by online discussion between the reviewers to determine the final conference program. The authors of accepted papers revised their manuscripts to produce this fine collection. DEXA 2005 received 390 submissions, and from those the Program Committee selected the 92 papers in these proceedings. This year the reviewing process generated more than 1000 referee reports. The hard work of the authors, the referees and the Program Committee is gratefully acknowledged.

Inhaltsverzeichnis

Frontmatter
How to Design a Loose Inter-organizational Workflow? An Illustrative Case Study

This work deals with the design of Loose Inter-Organizational Workflow (IOW). Loose IOW refers to occasional cooperation, free of structural constraints, where the partners involved and their number are not pre defined. We show that the design of Loose IOW application is very complex due to three factors: (i) the heterogeneity and distribution of the component processes, the organizations and the information (ii) the autonomy of each partner, which must be preserved (iii) the need to integrate in a coherent framework the three dimensions of a workflow: process, information and organization. One possible way to deal with this complexity, and to ease loose IOW applications design, is to use a well known software engineering principle:

theseparation of aspects

, which aims at decomposing a system in communicating sub systems, each one coping with a relevant abstraction that requires a model to be structured and described. Following this practice, a loose IOW application must be though as three communicating models: an informational model, an organizational model and a process model. The first two models are represented with UML class’s diagram, while the last model is described with

Petri Nets with Objects

(PNO), which are a formal language, have a very adequate expressive power and make the glue between the three workflow dimensions. We illustrate our solution through the well-known “reviewing papers” case study.

Lotfi Bouzguenda
Recovering from Malicious Attacks in Workflow Systems

Workflow management systems (WFMS) coordinate execution of logically related multiple tasks in an organization. Such coordination is achieved through dependencies that are specified between the tasks of a workflow. Often times preventive measures are not enough and a workflow may be subjected to malicious attacks. Traditional workflow recovery mechanisms do not address how to recover from malicious attacks. Database survivability techniques do not work for workflow because tasks in a workflow have dependencies that are not present in traditional transaction processing systems. In this paper, we present an algorithm that shows how we can assess and repair the effects of damage caused by malicious tasks. Our algorithm focuses not only on restoring the consistency of data items by removing the effects of malicious tasks but also takes appropriate actions to ensure the satisfaction of task dependencies among all the committed tasks.

Yajie Zhu, Tai Xin, Indrakshi Ray
Towards Mining Structural Workflow Patterns

Collaborative information systems are becoming more and more complex, involving numerous interacting business objects within considerable processes. Analysing the interaction structure of those complex systems will enable them to be well understood and controlled. The work described in this paper is a contribution to these problems for workflow based process applications. In fact, we discover workflow patterns from traces of workflow events based on a workflow mining technique. Workflow mining proposes techniques to acquire a workflow model from a workflow log. Mining of workflow patterns is done by a statistical analysis of log-based event. Our approach is characterised by a “local” workflow patterns discovery that allows to cover partial results and a dynamic technique dealing with concurrency.

Walid Gaaloul, Karim Baïna, Claude Godart
Avoiding Error-Prone Reordering Optimization During Legal Systems Migration

During Legal information systems migrations, one major problem is to handle attribute value cleaning. In the paper, we first show many data cleaning steps process on the values of the same data attributes and their derivations, and users may ignore or be puzzled by the data transforms that are defined to clean and transform the data sets, and such process can also be prone to error during process optimization. In this paper, we first define two major such problems as assignment conflict and range conflict,and giving problem definitions for such conflicts. Then we present two separate algorithms respectively to discover and solve the conflicts.

Youlin Fang, Heng Wang, Dongqing Yang
Automated SuperSQL Query Formulation Based on Statistical Characteristics of Data

In this paper, we propose an automated SuperSQL query formulating method based on statistical characteristics of data so that the end-users can obtain the desired information more easily without burden of web application development side. SuperSQL is an extension of SQL to generate various kinds of structured presentations and documents. In our method, first, web producers prepare a provisional SQL query and send it to a relational database. Next, the automated algorithm formulates a SuperSQL query based on statistical information of the query result. Finally the resulting SuperSQL query is executed to return the categorized table to end-users. Experimental results demonstrate that the implemented system enables web producers to construct data-intensive web sites, which have better structure with respect to the recognition of the data by end-users.

Jun Nemoto, Motomichi Toyama
Distribution Rules for Array Database Queries

Non-trivial retrieval applications involve complex computations on large multi-dimensional datasets. These should, in principle, benefit from the use of relational database technology. However, expressing such problems in terms of relational queries is difficult and time-consuming. Even more discouraging is the efficiency issue: query optimization strategies successful in classical relational domains may not suffice when applied to the multi-dimensional array domain. The RAM (Relational Array Mapping) system hides these difficulties by providing a transparent mapping between the scientific problem specification and the underlying database system. In addition, its optimizer is specifically tuned to exploit the characteristics of the array paradigm and to allow for automatic balanced work-load distribution. Using an example taken from the multimedia domain, this paper shows how a distributed real-word application can be efficiently implemented, using the RAM system, without user intervention.

Alex van Ballegooij, Roberto Cornacchia, Arjen P. de Vries, Martin Kersten
Efficient Processing of Distributed Top-k Queries

Ranking-aware queries, or top-

k

queries, have received much attention recently in various contexts such as web, multimedia retrieval, relational databases, and distributed systems. Top-

k

queries play a critical role in many decision-making related activities such as, identifying interesting objects, network monitoring, load balancing, etc. In this paper, we study the ranking aggregation problem in distributed systems. Prior research addressing this problem did not take data distributions into account, simply assuming the uniform data distribution among nodes, which is not realistic for real data sets and is, in general, inefficient. In this paper, we propose three efficient algorithms that consider data distributions in different ways. Our extensive experiments demonstrate the advantages of our approaches in terms of bandwidth consumption.

Hailing Yu, Hua-Gang Li, Ping Wu, Divyakant Agrawal, Amr El Abbadi
Evaluating Mid-(k, n) Queries Using B + -Tree

Traditional database systems assume that clients always consume the results of queries from the beginning. In various new applications especially in WWW, however, clients frequently need a small part of the result from the middle, e.g. retrieving a page in a bulletin board in WWW. To process this partial retrieval, traditional database systems should find all the records and discard unnecessary ones. Although several algorithms for top-

k

queries have been proposed, there has been no research effort for partial retrieving from the middle of an ordered result. In this paper, we define a mid-(

k

,

n

) query, which retrieves

n

records from the

k

th

record of an ordered result. We also propose an efficient algorithm for mid-(

k

,

n

) queries using a slightly modified B

 + 

-Tree, named the B

 + 

c

-Tree. We provide the theoretical analysis and the experimental results that the proposed technique evaluates mid-(

k

,

n

) queries efficiently.

Dongseop Kwon, Taewon Lee, Sukho Lee
On Effective E-mail Classification via Neural Networks

For addressing the growing problem of junk E-mail on the Internet, this paper proposes an effective E-mail classifying and cleansing method in this paper. Incidentally, E-mail messages can be modelled as semi-structured documents consisting of a set of fields with pre-defined semantics and a number of variable length free-text fields. Our proposed method deals with both fields having pre-defined semantics as well as variable length free-text fields for obtaining higher accuracy. The main contributions of this work are two-fold. First, we present a new model based on the Neural Network (NN) for classifying personal E-mails. In particular, we treat E-mail files as a particular kind of plain text files, the implication being that our feature set is relatively large (since there are thousands of different terms in different E-mail files). Second, we propose the use of Principal Component Analysis (PCA) as a preprocessor of NN to reduce the data in terms of both size as well as dimensionality so that the input data become more classifiable and faster for the convergence of the training process used in the NN model. The results of our performance evaluation demonstrate that the proposed algorithm is indeed effective in performing filtering with reasonable accuracy.

Bin Cui, Anirban Mondal, Jialie Shen, Gao Cong, Kian-Lee Tan
An Adaptive Spreading Activation Scheme for Performing More Effective Collaborative Recommendation

While Spread Activation has shown its effectiveness in solving the problem of cold start and sparsity in collaborative recommendation, it will suffer a decay of performance (over activation) as the dataset grows denser. In this paper, we first introduce the concepts of Rating Similarity Matrix (RSM) and Rating Similarity Aggregation (RSA), based on which we then extend the existing spreading activation scheme to deal with both the binary (transaction) and the numeric ratings. After that, an iterative algorithm is proposed to learn RSM parameters from the observed ratings, which makes it automatically adaptive to the user similarity shown through their ratings on different items. Thus the similarity calculations tend to be more reasonable and effective. Finally, we test our method on the EachMovie dataset, the most typical benchmark for collaborative recommendation and show that our method succeeds in relieving the effect of over activation and outperforms the existing algorithms on both the sparse and dense dataset.

Peng Han, Bo Xie, Fan Yang, Rui-Min Shen
Feature Selection by Ordered Rough Set Based Feature Weighting

The aim of feature subset selection is to reduce the complexity of an induction system by eliminating irrelevant and redundant features. Selecting the right set of features for classification task is one of the most important problems in designing a good classifier. In this paper we propose a feature selection approach based on rough set based feature weighting. In the approach the features are weighted and ranked in descending order. An incremental forward interleaved selection process is used to determine the best feature set with highest possible classification accuracy. The approach is experimented and tested using some standard datasets. The experiments carried out are to evaluate the influence of the feature pre-selection on the prediction accuracy of the rough classifier. The results showed that the accuracy could be improved with an appropriate feature pre-selection phase.

Qasem A. Al-Radaideh, Md Nasir Sulaiman, Mohd Hasan Selamat, Hamidah Ibrahim
A Full-Text Framework for the Image Retrieval Signal/Semantic Integration

This paper presents an approach for integrating perceptual signal features (i.e. color and texture) and semantic information within a coupled architecture for image indexing and retrieval. It relies on an expressive knowledge representation formalism handling high-level image descriptions and a full-text query framework. It consequently brings the level of image retrieval closer to users’ needs by translating low-level signal features to high-level conceptual data and integrate them with semantic characterization within index and query structures. Experiments on a corpus of 2500 photographs validate our approach by considering recall-precision indicators over a set of 46 full-text queries coupling high-level semantic and signal features.

Mohammed Belkhatir, Philippe Mulhem, Yves Chiaramella
A New Algorithm for Content-Based Region Query in Multimedia Databases

This article presents an original method of implementation of the color set back-projection algorithm. This is one of the most efficient methods of automated detection of color regions from an image. The detected regions are then used in the content-based region query. The query is realized on one or more regions, taking into consideration the color feature, location, area and the minimal bounding rectangle of the regions. The efficiency of the method was studied by means of a number of experiments effectuated with the help of a software system realized for this purpose, on a collection of synthetic images. This method of implementation influences the content-based region query efficiency in a series of domains in which this technique is used, such as medicine, art, education.

Dumitru Dan Burdescu, Liana Stanescu
SM3+: An XML Database Solution for the Management of MPEG-7 Descriptions

MPEG-7 is a promising standard for the description of multimedia content. A lot of applications based on MPEG-7 media descriptions have been set up. Therefore, an efficient storage solution for large amounts of MPEG-7 descriptions are certainly desirable. MPEG-7 documents are also data-centric XML documents. Due to many advantages, the relational DBMS is the best choice for storing such XML documents. However, the existing RDBMS-based XML storage solutions can not reach all the critical requirements for MPEG-7 descriptions management. In this paper, we analyse the problems when using existing RDBMS-based XML storage approaches to store MPEG-7 documents and then present a new storage approach, called SM3+ that integrates the advantages of existing XML storage models and avoid the main drawbacks from them. Its features can reach the most critical requirements for MPEG-7 documents storage and management. Performance studies are conducted and the experimental results are encouraging.

Yang Chu, Liang-Tien Chia, Sourav S. Bhowmick
LocalRank: Ranking Web Pages Considering Geographical Locality by Integrating Web and Databases

In this paper, we propose a method called LocalRank to rank web pages by integrating the web and a user database containing information on a specific geographical area. LocalRank is a rank value for a web page to assess its relevance degree to database entries considering geographical locality and its popularity on a local web space. In our method, we first construct a linked graph structure using entries contained in the database. The nodes of this graph consist of database entries and their related web pages. The edges in the graph are composed of semantic links including geographical links between these nodes, in addition to conventional hyperlinks. Then a link analysis is performed to compute a LocalRank value for each node. LocalRank can represent user’s interest since this graph effectively integrates the web and the user database. Our experimental results for a local restaurant database shows that local web pages related to the database entries are highly ranked based on our method.

Jianwei Zhang, Yoshiharu Ishikawa, Sayumi Kurokawa, Hiroyuki Kitagawa
My Portal Viewer: Integration System Based on User Preferences for News Web Sites

We developed a novel web application called “My Portal Viewer (MPV)”, which automatically categorizes and integrates meta-data from many news pages based on the user’s preferences after gathering these news pages from various news sites. Our unique approach is based on two points: one is an automatic categorization of collected information based on user’s interests and knowledge, and the other is the look and feel of the MPV page, which is applied to the user’s favorite news portal page, and part of the original content is replaced by the integrated content. Whenever a user accesses the MPV page after browsing news pages, he/she can obtain the desired content efficiently because the MPV presents pages refreshed based on the user’s behavior through his/her favorite page layout, which reflects his/her interests and knowledge. In this paper, we describe the MPV framework, and methods that are based on the user’s preferences for replacing and categorizing content have been developed using an HTML table model and a vector matching model.

Yukiko Kawai, Daisuke Kanjo, Katsumi Tanaka
Web Query Expansion by WordNet

In this paper, we address a novel method of Web query expansion by using WordNet and TSN. WordNet is an online lexical dictionary which describes word relationships in three dimensions of Hypernym, Hyponym and Synonym. And their impacts to expansions are different. We provide quantitative descriptions of the query expansion impact along each dimension. However, WordNet may bring many noises for the expansion due to its collection independent characteristic. Furthermore, it may not catch current state of words and their relationships because of the explosive increase of the Web. To overcome those problems, collection-based TSN (Term Semantic Network) is created with respect to word co-occurrence in the collection. We use TSN both as a filter and a supplement for WordNet. We also provide a quantitatively study as what is the best way for the expansion with TSN. In our system, we combine the query expansions along each semantic dimension as our overall solution. Our experiments reveal that the combined expansion can provide a satisfied result for the Web query performance. The methodologies in this paper have been already employed in our Web image search engine system.

Zhiguo Gong, Chan Wa Cheang, U Leong Hou
Webified Video: Media Conversion from TV Programs to Web Content for Cross-Media Information Integration

A method is proposed for viewing broadcast content that converts TV programs into Web content and integrates the results with related information, enabling one to flexibly browse the original content with value-added content and from various viewpoints. Even though the amount of information available via TV programs and the Web is increasing constantly, humans have a definite ceiling on their spare time for each media. To efficiently acquire information, they need a mechanism that can smoothly go back and forth across different media depending on situations and necessity. Our proposed method is a tool to achieve such cross-media information integration. Media conversion from TV programs to Web content enhances the browsability of the former, enabling one to skim a program’s outline or to efficiently search for favorite scenes. Integration with related information enriches viewing of TV programs in different ways, such as value-added content and content based on particular viewpoints. Preliminary testing of a prototype system for next-generation storage TV validated the approach taken by the proposed method.

Hisashi Miyamori, Katsumi Tanaka
A Caching Model for Real-Time Databases in Mobile Ad-Hoc Networks

Although caching has been shown to be an efficient technique to improve the performance of database systems, it also introduces the overhead and complexity in maintaining data consistency between the primary copies on servers and the cached copies on clients. Little research has been performed for data caching in the mobile ad-hoc network (MANET) environment where both servers and clients are nomadic. In this paper, a caching model called GMANET is designed to maintain both strong and weak cache consistency for distributed real-time database transaction systems in group-based MANETs, and at the same time, to incur as few update control messages as possible. GMANET is compared with the existing caching models by means of simulation. The experiment results show that the GMANET has the best performance in terms of percentage of transactions processed before their deadlines and is compatible with other caching models in terms of mobile hosts’ energy consumption.

Yanhong Li, Le Gruenwald
Adaptive Query Processing in Point-Transformation Schemes

Point-transformation schemes transform multidimensional points to one-dimensional values so that conventional access methods can be used to index multidimensional points, enabling convenient integration of multidimensional query facilities into complex transactional database management systems. However, in high-dimensional data spaces, the computational complexity of the required query transformation is prohibitive. This paper proposes a near-optimal solution based on our novel ideas of adaptive Z-ordering. The experimental results show that the proposed approach significantly improves the query performance of point-transformation schemes. The idea can easily be generalized to accommodate any hierarchical space-filling curve, not just the Z-curve.

Byunggu Yu
On the General Signature Trees

The signature file method is a popular indexing technique used in information retrieval and databases. It excels in efficient index maintenance and lower space overhead. Different approaches for organizing signature files have been proposed, such as sequential signature files, bit-slice files, S-trees, and its different variants, as well as signature trees. In this paper, we extends the structure of signature trees by introducing multiple-bit checkings. That is, during the searching of a signature tree against a query signature

s

q

, more than one bit in

s

q

will be checked each time when a node is encoun tered. This does not only reduce significantly the size of a signature tree, but also increases the filtering ability of the

signature tree

. We call such a structure a

general signature tree

. Experiments have been made, showing that the general signature tree uniformly outperforms the signature tree approach.

Yangjun Chen
Optimizing I/O Costs of Multi-dimensional Queries Using Bitmap Indices

Bitmap indices are efficient data structures for processing complex, multi-dimensional queries in data warehouse applications and scientific data analysis. For high-cardinality attributes, a common approach is to build bitmap indices with binning. This technique partitions the attribute values into a number of ranges, called bins, and uses bitmap vectors to represent bins (attribute ranges) rather than distinct values. In order to yield exact query answers, parts of the original data values have to be read from disk for checking against the query constraint. This process is referred to as

candidate check

and usually dominates the total query processing time.

In this paper we study several strategies for optimizing the

candidate check

cost for multi-dimensional queries. We present an efficient

candidate check

algorithm based on attribute value distribution, query distribution as well as query selectivity with respect to each dimension. We also show that re-ordering the dimensions during query evaluation can be used to reduce I/O costs. We tested our algorithm on data with various attribute value distributions and query distributions. Our approach shows a significant improvement over traditional binning strategies for bitmap indices.

Doron Rotem, Kurt Stockinger, Kesheng Wu
Environmental Noise Classification for Multimedia Libraries

In the modern information society, multimedia libraries are increasingly essential core components of the information systems managing our digital assets. The effective and efficient management of large amounts of multimedia information involves the extraction of relevant features from unstructured multimedia documents, images, videos, and sound recordings, as well as the organization, classification, and retrieval of these multimedia documents. A particularly important aspect is the opportunity to combine a variety of diverse features. In this paper we are interested in a feature rarely considered in such systems: the environmental noise. We design, implement, present, and evaluate an experimental multimedia library system for video clips and sound recordings in which scenes are indexed, classified and retrieved according to their environmental noise. Namely, after adequate training, the system is able distinguish between such scenes as traffic scenes, canteen scenes, and gunfight scenes, for instance. We show how we improved existing techniques for the classification of sound to reach an accuracy of up to 90% in the recognition of environmental noise.

Stéphane Bressan, Boon Tiang Tan
Quality-Aware Replication of Multimedia Data

In contrast to alpha-numerical data, multimedia data can have a wide range of quality parameters such as spatial and temporal resolution, and compression format. Users can request data with a specific quality requirement due to the needs of their applications, or the limitations of their resources. On-the-fly conversion of multimedia data (such as video transcoding) is very CPU intensive and can limit the level of concurrent access supported by the database. Storing all possible replicas, on the other hand, requires unacceptable increases in storage requirements. Although replication has been well studied, to the best of our knowledge, the problem of multiple-quality replication has not been addressed. In this paper we address the problem of multiple-quality replica selection subject to an overall storage constraint.

We establish that the problem is NP-hard and provide heuristic solutions under a soft quality system model where users are willing to negotiate their quality needs. An important optimization goal under such a model is to minimize utility loss. We propose a powerful greedy algorithm to solve this optimization problem. Extensive simulations show that our algorithm finds near-optimal solutions. The algorithm is flexible in that it can be extended to deal with replica selection for multiple media objects and changes of query pattern. We also discuss an extended version of the algorithm with potentially better performance.

Yi-Cheng Tu, Jingfeng Yan, Sunil Prabhakar
Rotation and Gray-Scale Invariant Classification of Textures Improved by Spatial Distribution of Features

In this paper, we present a framework for texture descriptors based on spatial distribution of textural features. Our approach is based on the observation that regional properties of textures are well captured by correlations among local texture patterns. The proposed method has been evaluated through experiments using real textures, and has shown significant improvements in recognition rates.

Gouchol Pok, Keun Ho Ryu, Jyh-charn Lyu
Zooming Cross-Media: A Zooming Description Language Coding LOD Control and Media Transition

We propose a “Zooming Cross-Media” concept that uses zooming to achieve both changes in the level of detail and transitions between media, for contents containing varied media. Examples are text, images, video, and sound. As part of the concept, we propose a zooming description language (ZDL) based on XML. Unlike existing zooming interfaces, ZDL codes the zooming operation and behavior on the content side. Because ZDL adopts XML coding, we can locate “zooming” as the third interface in the Web document environment after “scrolling” and “anchor clicking.” The zooming operation and behavior is independently coded from the content structure in ZDL. With ZDL, it is possible to (1) control the zooming of each “zoom object” making up the contents, (2) control the degree of zooming by introducing a “zoom rate” parameter, and (3) relate objects mutually and specify zooming propagation between related objects.

Tadashi Araki, Hisashi Miyamori, Mitsuru Minakuchi, Ai Kato, Zoran Stejic, Yasushi Ogawa, Katsumi Tanaka
A Histogram-Based Selectivity Estimator for Skewed XML Data

The optimization of XML queries requires an accurate and compact structure to capture the characteristics of the underlying data. A compact structure works well when the data is uniformly distributed and has many common paths. However, more detailed information needs to be maintained when the data is skewed. This work presents a histogram-based structure to capture the distribution of skewed XML data. It builds upon a statistical method to estimate the result size of XML queries. Experiment results indicate that the proposed method leads to a more accurate estimation.

Hanyu Li, Mong Li Lee, Wynne Hsu
Accelerating XML Structural Join by Partitioning

Structural join is the core part of XML queries and has a significant impact on the performance of XML queries, several classical structural join algorithms have been proposed such as

Stack-tree

join and

XR-Tree

join. In this paper, we consider to answer the problem of structural join by partitioning. We first extend the relationships between nodes to the relationships between partitions in the plane and get some observations. We then propose a new partition-based method

P-Join

for structural join. Based on

P-Join

, moreover, we present an enhanced partitioned-based spatial structural join algorithm PSSJ.

Nan Tang, Jeffrey Xu Yu, Kam-Fai Wong, Kevin Lü, Jianxin Li
Efficient Dissemination of Filtered Data in XML-Based SDI

With the increasing popularity of XML, there is a growing number of reports on the XML-based Selective Dissemination of Information (SDI) in the literature, in which many XML filtering systems have been proposed to support the efficient XML-based SDI. While previous research into XML filtering focused on efficient filtering algorithms, in this paper, we consider a novel XML dissemination problem, focusing on the efficient dissemination of filtered data. To this end, we describe a significant problem associated with the dissemination of XML documents and propose effective algorithms which allow for its resolution. Based on these effective algorithms, we developed an XML-based SDI system that can disseminate XML documents efficiently. The experimental results demonstrate that the proposed algorithms outperform earlier XML filtering systems in terms of the usage of network bandwidth.

Jae-Ho Choi, Young-Jin Yoon, SangKeun Lee
Efficient Processing of Ordered XML Twig Pattern

Finding all the occurrences of a twig pattern in an XML database is a core operation for efficient evaluation of XML queries. Holistic twig join algorithm has showed its superiority over binary decompose based approach due to efficient reducing intermediate results. The existing holistic join algorithms, however, cannot deal with

ordered

twig queries. A straightforward approach that first matches the unordered twig queries and then prunes away the undesired answers is obviously not optimal in most cases. In this paper, we study a novel holistic-processing algorithm, called

OrderedTJ

, for ordered twig queries. We show that

OrderedTJ

can identify a large query class to guarantee the I/O optimality. Finally, our experiments show the effectiveness, scalability and efficiency of our proposed algorithm.

Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni
A Flexible Role-Based Delegation Model Using Characteristics of Permissions

Role-Based Access Control(RBAC) has recently received considerable attention as a promising alternative to traditional discretionary and mandatory access controls.[7] RBAC ensures that only authorized users are given access to protected data or resources. A successful marriage of Web and RBAC technology can support effective security in large scale enterprise-wide systems with various organization structures. Most large organizations have some business rules related to access control policy. Delegation of authority is an important one of these rules.[1] RBDM0, RDM2000 and PBDM models are recently published models for role-based delegation. RBDM0 and RDM2000 models deal with user-to-user delegation and total delegation. PBDM supports user-to-user and role-to-role delegations and also supports both role and permission level delegation, which provides great flexibility in authority management. But PBDM does not support constraints in RBAC delegation models, such as separation of duty in user-to-user and role to-role delegation. This paper proposes a new delegation model using characteristics of permissions, in which security administrator can easily perform partial delegation, permission level delegation and restricted inheritance. It supports flexible delegation by dividing a role into sub-roles according to characteristics of permissions assigned to the role and considering delegation and inheritance simultaneously. It provides flexibility in authority management such as multi-step delegation, multi-option revocation and controlled inheritance by including characteristics of PBDM and sub-role hierarchies concept. It also supports constraints such as separation of duty based on permission in user-to-user and role-to-role delegation.

Dong-Gue Park, You-Ri Lee
Provable Data Privacy

In relational database systems a combination of privileges and views is employed to limit a user’s access and to hide non-public data. The data privacy problem is to decide whether the views leak information about the underlying database instance. Or, to put it more formally, the question is whether there are certain answers of a database query with respect to the given view instance. In order to answer the problem of provable date privacy, we will make use of query answering techniques for data exchange. We also investigate the impact of database dependencies on the privacy problem. An example about health care statistics in Switzerland shows that we also have to consider dependencies which are inherent in the semantics of the data.

Kilian Stoffel, Thomas Studer
Formalizing the XML Schema Matching Problem as a Constraint Optimization Problem

The first step in finding an efficient way to solve any difficult problem is making a complete, possibly formal, problem specification. This paper introduces a formal specification for the problem of

semantic XML schema matching

. Semantic schema matching has been extensively researched, and many matching systems have been developed. However, formal specifications of problems being solved by these systems do not exist, or are partial. In this paper, we analyze the problem of semantic schema matching, identify its main components and deliver a formal specification based on the constraint optimization problem formalism. Throughout the paper, we consider the schema matching problem as encountered in the context of a large scale XML schema matching application.

Marko Smiljanić, Maurice van Keulen, Willem Jonker
Evolving XML Schemas and Documents Using UML Class Diagrams

The widespread use of XML brings new challenges for its integration into general software development processes. In particular, it is necessary to keep the consistency between different software artifacts and XML documents when evolution tasks are carried out. In this paper we present an approach to evolve XML schemas and documents conceptually modeled by means of UML class diagrams. Evolution primitives are issued on the UML class diagram and are automatically propagated down to the XML schema. The XML documents are also automatically modified to conform to the new XML schema. In this way, the consistency between the different artifacts involved is kept. This goal is achieved by using an intermediate component which reflects how the UML diagrams are translated into the XML schemas.

Eladio Domínguez, Jorge Lloret, Ángel L. Rubio, María A. Zapata
Building XML Documents and Schemas to Support Object Data Exchange and Communication

The exchange of data between heterogeneous database systems is becoming a key issue in several application domains.

XML

is emerging as the standard mean for data exchange over the web. As for object oriented databases storing complex information, it is important to be able to exchange both objects and object schemas. In this paper we propose two approaches for translating database objects into

XML

documents which are both human readable and suitable for system based queries, thus preserving object semantics. Both approaches have been validated by a running prototype.

Carlo Combi, Giuseppe Pozzi
Intensional Encapsulations of Database Subsets via Genetic Programming

Finding intensional encapsulations of database subsets is the inverse of query evaluation. Whereas query evaluation transforms an intensional expression (the query) to its extension (a set of data values), intensional encapsulation assigns an intensional expression to a given set of data values. We describe a method for deriving intensional representations of subsets of records in large database tables. Our method is based on the paradigm of genetic programming. It is shown to achieve high accuracy and maintain compact expression size, while requiring cost that is acceptable to all applications, but those that require instantaneous results. Intensional encapsulation has a broad range of applications including cooperative answering, information integration, security and data mining.

Aybar C. Acar, Amihai Motro
Preferred Skyline: A Hybrid Approach Between SQLf and Skyline

The World Wide Web (WWW) is a great repository of data and it may reference thousands of data sources for almost any knowledge domain. Users frequently access sources to query information and may be interested only in the top k answers that meet their preferences. Although, many declarative languages have been defined to express WWW queries, the problem of specifying user preferences and considering this information during query optimization and evaluation remains open. Most used query languages, such as SQL and XQUERY, do not allow specifying general conditions on user preferences. For example, using the SQL ORDER BY clause one can express the order in which the answer will appear, but the user must be aware of filtering the tuples that satisfy their preferences. Skyline and SQLf are two extensions of SQL defined to express general user preferences. Skyline offers physical operators to construct a Pareto Curve of the non-dominated answers. Skyline may return answers with bad values for some criteria and does not discriminate between the non-dominated answers. On the other hand, SQLf gives the best answers in terms of user preferences, but it may return dominated answers. Finally, the skyline operator evaluation time is higher than SQLf. We proposed a hybrid approach, Preferred Skyline, integrating skyline and SQLf to produce only answers in the Pareto Curve with best satisfaction degrees. We report initial experimental results on execution time and answer precision. They show that Preferred Skyline consumes less time than Skyline and its precision is higher than SQLf.

Marlene Goncalves, María-Esther Vidal
Resolution of Semantic Queries on a Set of Web Services

In this article, we present an approach to executing semantic queries on a set of domain-related Web services. The aim of this architecture is to obtain information from the available Web services, by mapping inputs and outputs of the Web service methods into concepts of an ontology defined in OWL (Ontology Web Language). For this purpose, we extend OWL-S to assign concepts and their contexts to the inputs and outputs of the methods, so that each method parameter is annotated without ambiguity. The paper also presents a graph-based data structure in which all the information needed for the semantic composition of annotated Web services is stated. Finally, we show how this data structure is used to solve semantic queries over the Web services.

Jordi Paraire, Rafael Berlanga, Dolores M. Llidó
Detecting Semantically Correct Changes to Relevant Unordered Hidden Web Data

Current proposals for XML change detection use structural constraints to detect the changes and they ignore semantic constraints. Consequently, they may produce

semantically incorrect

changes. In this paper, we argue that the semantics of data is important for change detection. We present a semantic-conscious change detection technique for the hidden web data. In our approach we transform the unordered hidden web query results to XML format and then detect the changes between two versions of XML representation of the hidden web data by extending X-Diff, a published unordered XML change detection algorithm. By taking advantage of the semantics, we experimentally demonstrate that our change detection approach runs up to 7 times faster than X-Diff on real life hidden web data and always detect changes that are semantically more correct than those detected by existing proposals.

Vladimir Kovalev, Sourav S. Bhowmick
Design for All in Information Technology: A Universal Concern

The concept of Design for All is not well understood, and the issues of accessibility and inclusion are often relegated to specialists and dedicated conferences. This paper introduces the concept of Design for All dispelling some of the misunderstandings that surround it, and situating it within the Information Technology context, as distinct from wider considerations such accessibility in the built environment. Some of the reasons for undertaking Design for All are discussed, and, making use of the analogy of the printed book, the paper then shows how Design for All in combination with Information technologies are enablers in the widest sense of the term. Finally, it is noted that Design for All is a process, not a product, and while there are people who specialise in eAccessibility, the research agenda demands more involvement from information technologists of all kinds. These are issues that concern us all, in our roles as designers and implementers of information technology, as well as in our role as consumers of information and participants in the Information Society.

Jenny Darzentas, Klaus Miesenberger
An Efficient Scheme of Update Robust XML Numbering with XML to Relational Mapping

Despite the emergence of XML as a standard for data exchange on the Web, XML update processing has received little attention. One of the issues that need to be addressed in XML update processing is the

update robustness

of the

XML numbering scheme

employed for efficient containment query processing. In this paper, we propose an efficient update robust XML numbering scheme. It works for both ordered and unordered XML data and fits well in

XML to relational mapping

. The proposed scheme was implemented with a relational database as the XML stores. The experimental results showed that our scheme is more efficient than the previous one.

Hyunchul Kang, Young-Hyun Kim
On Maintaining XML Linking Integrity During Update

It is a fact that XML update has become more important with the rise of XML Database usage. How update operations affect XML documents needs to be investigated further. In this paper we propose a methodology to accommodate update without violating the XML document’s constraints. The constraints maintained are those that are defined using XML linking language:

xlink

and

xpointer

. This language, which is standardized by W3C, is used to provide referential purpose among XML documents or nodes.

Since XML link is embedded as an attribute in an XML instance, our proposal can be used for schema-less documents and for instance-based reference. We propose a set of functions that perform checking mechanisms before updates. The proposed method can be implemented in various ways, and in this case we use XQuery language.

Eric Pardede, J. Wenny Rahayu, David Taniar
On the Midpoint of a Set of XML Documents

The WWW contains a huge amount of documents. Some of them share the subject, but are generated by different people or even organizations. To guarantee the interchange of such documents, we can use XML, which allows to share documents that do not have the same structure. However, it makes difficult to understand the core of such heterogeneous documents (in general, schema is not available). In this paper, we offer a characterization and algorithm to obtain the midpoint (in terms of a resemblance function) of a set of semi-structured, heterogeneous documents without optional elements. The trivial case of midpoint would be the common elements to all documents. Nevertheless, in cases with several heterogeneous documents this may result in an empty set. Thus, we consider that those elements present in a given amount of documents belong to the midpoint. A exact schema could always be found generating optional elements. However, the exact schema of the whole set may result in overspecialization (lots of optional elements), which would make it useless.

Alberto Abelló, Xavier de Palol, Mohand-Saïd Hacid
Full-Text and Structural XML Indexing on B + -Tree

XML query processing is one of the most active areas of database research. Although the main focus of past research has been the processing of structural XML queries, there are growing demands for a full-text search for XML documents. In this paper, we propose XICS (XML Indices for Content and Structural search), novel indices built on a B+-tree, for the fast processing of queries that involve structural and fulltext searches of XML documents. To represent the structural information of XML trees, each node in the XML tree is labeled with an identifier. The identifier contains an integer number representing the path information from the root node. XICS consist of two types of indices, the COB-tree (COntent B+-tree) and the STB-tree (STructure B+-tree). The search keys of the COB-tree are a pair of text fragments in the XML document and the identifiers of the leaf nodes that contain the text, whereas the search keys of the STB-tree are the node identifiers. By using a node identifier in the search keys, we can retrieve only the entries that match the path information in the query. Our experimental results show the efficiency of XICS in query processing.

Toshiyuki Shimizu, Masatoshi Yoshikawa
XML-Based e-Barter System for Circular Supply Exchange

A key function for any barter service is to detect circular exchanges in which all demands and supplies of a circle of users are satisfied. We call the demand and supply of a user his queries and data, respectively. The problem of finding a circular exchange is to detect directed cycles in an exchange graph where an edge connects one user’s supply to another user’s supply that satisfies the first user’s demand. Our contributions to solving this problem are two-fold; 1) a process model of constructing an exchange graph, and 2) two cycle detection algorithms that can find all possible directed cycles. Our model processes an incoming user’s queries and data across the stored users’ data and queries, respectively, by combining database query processing and stream data processing. The algorithms are extensions of depth-first search (DFS) and Strongly-Connected-Component search (SCCS). Experiments show that our enhanced version of SCCS outperforms the enhanced version of DFS by factors ranging from 23 to 132.

Shuichi Nishioka, Yuri Yaguchi, Takahiro Hamada, Makoto Onizuka, Masashi Yamamuro
Context-Sensitive Complementary Information Retrieval for Text Stream

With constant advances in information technology, more and more information is available and users’ information needs are becoming more diverse. Most conventional information systems only attempt to provide information that meets users’ specific interests. In contrast, we are working on ways of discovering information from the viewpoints of both interest and necessity. For example, we are trying to discover complementary information that provides additional knowledge on the users’ topics of interest, not just information that is similar to the topic. In previous work, which was based on extracting topic structures from closed-caption data, we proposed methods of searching for information to complement TV program content; that is, to provide users with more detailed information or different viewpoints. In this paper, we focus on the features of text streams (closed-caption data, etc.) and propose a method for context-sensitive retrieval of complementary information. We modified our topic-structure model for content representation and consider the “context” of a text stream in searching for complementary information. The “context” of the text stream is considered to be a series of topic structures. Based on such kind of context, we propose methods of searching for complementary information for TV programs, including query-type selection, query modification, and computation of the degree of complementarity. The experiment results showed that, comparing to our previous methods, the context-sensitive method could provide more additional information and avoid information overlap.

Qiang Ma, Katsumi Tanaka
Detecting Changes to Hybrid XML Documents Using Relational Databases

Recent works in XML change detection have focused on detecting changes to ordered or unordered XML documents. However, in real life XML documents may not always be

purely ordered

or

purely unordered

. It is indeed possible to have both ordered and unordered nodes in the

same

XML document (such documents are called

hybrid

XML). In this paper, we present a technique for detecting the changes to hybrid XML documents. In our approach, old and new versions of XML documents are first stored in a relational database. Then, the

order learning module

is used to determine the

node types

in hybrid XML. The change detection module then uses the knowledge of node types to detect the changes by issuing SQL queries. Our experimental results show that our approach produces better

result quality

compared to existing approaches.

Erwin Leonardi, Sri L. Budiman, Sourav S. Bhowmick
An Index-Based Method for Timestamped Event Sequence Matching

This paper addresses the problem of timestamped event sequence matching, a new type of sequence matching that retrieves the occurrences of interesting patterns from a timestamped event sequence. Timestamped event sequence matching is useful for discovering temporal causal relationships among timestamped events. In this paper, we first point out the shortcomings of prior approaches to this problem and then propose a novel method that employs an

R

 ∗ 

-tree to overcome them. To build an

R

 ∗ 

-tree, it places a time window at every position of a timestamped event sequence and represents each window as an

n

-dimensional rectangle by considering the first and last occurrence times of each event type. Here,

n

is the total number of disparate event types that may occur in a target application. When

n

is large, we apply a grouping technique to reduce the dimensionality of an

R

 ∗ 

-tree. To retrieve the occurrences of a query pattern from a timestamped event sequence, the proposed method first identifies a small number of candidates by searching an

R

 ∗ 

-tree and then picks out true answers from them. We prove its robustness formally, and also show its effectiveness via extensive experiments.

Sanghyun Park, Jung-Im Won, Jee-Hee Yoon, Sang-Wook Kim
Time Parameterized Interval R-Tree for Tracing Tags in RFID Systems

For tracing tag locations, the trajectories should be modeled and indexed in a radio frequency identification (RFID) system. The trajectory of a tag is represented as a line that connects two spatiotemporal locations captured when the tag enters and leaves the vicinity of a reader. If a tag enters but does not leave a reader, its trajectory is represented only as a point captured at entry. Because the information that a tag stays in a reader is missing from the trajectory represented only as a point, it is impossible to find the tag that remains in a reader. To solve this problem we propose the data model in which trajectories are defined as time-parameterized intervals and new index scheme called the Time Parameterized Interval R-tree. We also propose new insert and split algorithms to enable efficient query processing. We evaluate the performance of the proposed index scheme and compare it with the R-tree and the R*-tree. Our experiments show that the new index scheme outperforms the other two in processing queries of tags on various datasets.

ChaeHoon Ban, BongHee Hong, DongHyun Kim
Efficient Algorithms for Constructing Time Decompositions of Time Stamped Documents

Identifying temporal information of topics from a document set typically involves constructing a time decomposition of the time period associated with the document set. In an earlier work, we formulated several metrics on a time decomposition, such as size, information loss, and variability, and gave dynamic programming based algorithms to construct time decompositions that are optimal with respect to these metrics. Computing information loss values for all subintervals of the time period is central to the computation of optimal time decompositions. This paper proposes several algorithms to assist in more efficiently constructing an optimal time decomposition. More efficient, parallelizable algorithms for computing loss values are described. An efficient top-down greedy heuristic to construct an optimal time decomposition is also presented. Experiments to study the performance of this greedy heuristic were conducted. Although lossy time decompositions constructed by the greedy heuristic are suboptimal, they seem to be better than the widely used uniform length decompositions.

Parvathi Chundi, Rui Zhang, Daniel J. Rosenkrantz
Querying by Sketch Geographical Databases and Ambiguities

This paper presents GSQL (Geographical Sketch Query Language), a sketch-based approach for querying geographical databases based on the GeoPQL query language. Geographic information is intrinsically spatial and can be conveniently represented using a bi-dimensional space. Graphical User Interfaces and Visual Languages can be used to satisfy this need. However, the growing availability of sketching tools (PDA, digital pens, etc.) enables a more informal and natural user interaction. Sketch based interaction is very effective. Each user can easily sketch his/her geographical query by drawing it, erasing and modifying its parts and highlighting the query target. A query is the expression of the configuration of the expected result. Sketch recognition and query interpretation (and solution of their ambiguities) starts from a context- independent approach and uses the characteristic application domain information. Context-independent sketch interpretation uses spatial and temporal information related to the sketching process. Context-dependent sketch interpretation uses geographic domain information to solve the remaining ambiguities and correctly interpret the drawing and query. An analysis of the ambiguities characterising object sketching in the geographic application domain and their possible solutions are presented herein. Each query identifies the set of geographical objects involved and the target; the query interpretation must unambiguously identify the set of its results.

Fernando Ferri, Patrizia Grifoni, Maurizio Rafanelli
Foundations for Automated Trading — Its the Information That Matters

Business relationships, and the sense of trust that they embody, provides an environment in which trade may be conducted with confidence. Trading involves the maintenance of effective relationships, and refers to the process of: need identification, partner selection, offer-exchange, contract negotiation, and contract execution. So we use the term in a sense that includes the business of e-procurement. Of particular interest are: the selection of trading partners, the development of trading relationships, and the negotiation and execution of contracts in the context of a relationship.

John Debenham
Intensional Query Answering to XQuery Expressions

XML is a representation of data which may require huge amounts of storage space and query processing time. Summarized representations of XML data provide succinct information which can be directly queried, either when fast yet approximate answers are sufficient, or when the actual dataset is not available. In this work we show which kinds of XQuery expressions admit a partial answer by using association rules extracted from XML datasets. Such partial information provide intensional answers to queries formulated as XQuery expressions.

Simone Gasparini, Elisa Quintarelli
Optimizing Sorting and Duplicate Elimination in XQuery Path Expressions

XQuery expressions can manipulate two kinds of order:

document order

and

sequence order

. While the user can impose or observe the order of items within a sequence, the results of path expressions must always be returned in document order. Correctness can be obtained by inserting explicit (and expensive) operations to sort and remove duplicates after each XPath step. However, many such operations are redundant. In this paper, we present a systematic approach to remove unnecessary sorting and duplicate elimination operations in path expressions in XQuery 1.0. The technique uses an automaton-based algorithm which we have applied successfully to path expressions within a complete XQuery implementation. Experimental results show that the algorithm detects and eliminates most redundant sorting and duplicate elimination operators and is very effective on common XQuery path expressions.

Mary Fernández, Jan Hidders, Philippe Michiels, Jérôme Siméon, Roel Vercammen
SIOUX: An Efficient Index for Processing Structural XQueries

XML DBMSs require new indexing techniques to efficiently process structural search and full-text search as integrated in XQuery. Much research has been done for indexing XML documents. In this paper we first survey some of them and suggest a classification scheme. It appears that most techniques are indexing on paths in XML documents and maintain a separated index on values. In some cases, the two indexes are merged and/or tags are encoded. We propose a new method that indexes XML documents on ordered trees, i.e., two documents are in the same equivalence class is they have the same tree structure, with identical elements in order. We develop a simple benchmark to compare our method with two well-known European products. The results show that indexing on full trees leads to smaller index size and achieves 1 to 10 times better query performance in comparison with classical industrial methods that are path-based.

Georges Gardarin, Laurent Yeh
Searching Multi-hierarchical XML Documents: The Case of Fragmentation

To properly encode properties of textual documents using XML, multiple markup hierarchies must be used, often leading to conflicting markup in encodings. Text Encoding Initiative (TEI) Guidelines [1] recognize this problem and suggest a number of ways to incorporate multiple hierarchies in a single well-formed XML document. In this paper, we present a framework for processing XPath queries over multi-hierarchical XML documents represented using fragmentation, one of the TEI-suggested techniques. We define the semantics of XPath over DOM trees of fragmented XML, extend the path expression language to cover overlap in markup, and describe FragXPath, our implementation of the proposed XPath semantics over fragmented markup.

Alex Dekhtyar, Ionut E. Iacob, Srikanth Methuku
Semantic Storage: A Report on Performance and Flexibility

Desktop search tools are becoming more popular. They have to deal with increasing amounts of locally stored data. Another approach is to analyze the semantic relationship between collected data in order to preprocess the data semantically. The goal is to allow searches based on relationships between various objects instead of focusing on the name of objects. We introduce a database architecture based on an existing software prototype, which is capable of meeting the various demands for a semantic information manager. We describe the use of an association table which stores the relationships between events. It enables adding or removing data items easily without the need for schema modifications. Existing optimization techniques of RDBMS can still be used.

Edgar R. Weippl, Markus Klemen, Manfred Linnert, Stefan Fenz, Gernot Goluch, A. Min Tjoa
Towards Truly Extensible Database Systems

In modern universal database management systems (DBMSs) user-defined data types along with extensible indexing structures try to bridge the gap between standard data-independent DBMS implementations and the requirement of specialized access methods for efficient domain-specific data retrieval, maintenance, and storage. However, these approaches often suffer from restricting the degree of freedom in the implementation and limiting the availability of crucial database features. Due to their design concepts, these extensible indexing frameworks are not intended to be suitable for rapid development and evaluation of research prototypes, as they lack essential generalization, completeness, and depth of their integration into the host DBMS. We discuss the advantages and drawbacks of available extensible indexing techniques and present several methods that can be easily combined into a powerful and flexible framework for storing, indexing, and manipulating domain specific data from any data source. We demonstrate that this framework comprises all properties truly extensible indexing should have. A prototype implementation of this framework was integrated into the relational DBMS Transbase®.

Ralph Acker, Roland Pieringer, Rudolf Bayer
Transaction Management with Integrity Checking

Database integrity constraints, understood as logical conditions that must hold for any database state, are not fully supported by current database technology. It is typically up to the database designer and application programmer to enforce integrity via triggers or tests at the application level, which are difficult to maintain and error prone. Two important aspects must be taken care of. 1. It is too time consuming to check integrity constraints from scratch after each update, so simplified checks before each update should be used relying on the assumption that the current state is consistent. 2. In concurrent database systems, besides the traditional correctness criterion, the execution schedule must ensure that the different transactions can overlap in time without destroying the consistency requirements tested by other, concurrent transactions. We show in this paper how to apply a method for incremental integrity checking to automatically extend update transactions with locks and simplified consistency tests on the locked elements. All schedules produced in this way are conflict serializable and preserve consistency in an optimized way.

Davide Martinenghi, Henning Christiansen
An Optimal Skew-insensitive Join and Multi-join Algorithm for Distributed Architectures

The development of scalable parallel database systems requires the design of efficient algorithms for the join operation which is the most frequent and expensive operation in relational database systems. The join is also the most vulnerable operation to data skew and to the high cost of communication in distributed architectures.

In this paper, we present a new parallel algorithm for join and multi-join operations on distributed architectures based on an efficient semi-join computation technique. This algorithm is proved to have

optimal complexity

and

deterministic perfect load balancing

. Its tradeoff between balancing overhead and speedup is analyzed using the BSP cost model which predicts a negligible join product skew and a linear speed-up. This algorithm improves our

fa_join

and

sfa_join

algorithms by reducing their communication and synchronization cost to a minimum while offering the same load balancing properties even for highly skewed data.

Mostafa Bamha
Evaluation and NLP

F-measure is an indicator which has been commonly used for 25 years to evaluate classification algorithms in textmining, based on precision and recall. For classification and information retrieval, some prefer to use the break even point. Nevertheless, these measures have some inconvenient: they use a binary logic and don’t allow to apply a user (judge) assessment. This paper proposes a new approach for evaluation. First, we distinguish classification and categorization from a semantic point of view. Then, we introduce a new measure: the K-measure, which is an overall of F-measure, and allows to apply user requirements. Finally, we propose a methodology for evaluation.

Didier Nakache, Elisabeth Metais, Jean François Timsit
Movies Recommenders Systems: Automation of the Information and Evaluation Phases in a Multi-criteria Decision-Making Process

The authors’ interest is focused on advanced recommending functionalities proposed by more and more Internet websites w.r.t. the selection of movies, e-business sites, or any e-purchases. These functionalities often rely on the Internet users’ opinions and evaluations. A « movie-recommender » application is presented. Recommender websites generally propose an aggregation of the user’s evaluations critics according to different relevant criteria w.r.t. the application.The authors propose an Information Processing System (IPS) to collect, process and manage as automatically as possible these opinions or critics to support this multi criteria evaluation for recommendation. The RS (Recommender System) firstly proposes information extraction techniques in order to classify the available users’ critics w.r.t. the criteria implied in the evaluation process and to automatically associate numerical scores to these critics. Then multicriteria techniques are introduced to numerically evaluate, compare and rank the competing movies the critics are reported to. Finally the RS is presented as an interactive Decision-Making Support System (DMSS) relying on a sensibility analysis of the movies ranking. A particular attention is paid to the automation of the information phase in the decision-making process: movie comments cartography according to users’ evaluation criteria and attribution of a partial score to each critic considered as the expression of a value judgment in natural language.

Michel Plantié, Jacky Montmain, Gérard Dray
On Building a DyQE – A Medical Information System for Exploring Imprecise Queries

This paper sets out DyQE, a dynamic query expansion information retrieval system implemented in the medical domain, aimed at exploring imprecise queries. DyQE enhances standard query and result set navigation through integration with dynamic taxonomy and broad query expansion. The query expansion rules are sourced from a fine grained evaluation of data retrieval-based query expansion. DyQE offers both a concise representation of a broad subject area, and a wide variety of interesting links to related subject areas.

Dennis Wollersheim, Wenny J. Rahayu
A Proposal for a Unified Process for Ontology Building: UPON

Ontologies are the backbone of the Semantic Web, a semantic-aware version of the World Wide Web. To the end of making available large-scale, high quality domain ontologies, effective and usable methodologies are needed to facilitate the process of Ontology Building. Many of the methods proposed so far only partly refer to well-known and widely used standards from other areas, like software engineering and knowledge representation. In this paper we present UPON, a methodology for ontology building derived from the Unified Software Development Process. A comparative evaluation with other methodologies, as well as the results of its adoption in the context of the Athena Integrated Project, are also discussed.

Antonio De Nicola, Michele Missikoff, Roberto Navigli
Transforming Software Package Classification Hierarchies into Goal-Based Taxonomies

Software package selection is an activity that plays an increasingly crucial role in the delivery of software systems. One of its main open issues is how to structure the knowledge about the software marketplace and in particular how to know which types of packages are available and which are their objectives. Profit and non-profit organizations of any kind use to arrange these types into categories in a hierarchical form. However, the rationale behind the proposals found is often confusing and therefore their usefulness is hampered. In this paper we propose the use of taxonomies for structuring this knowledge. Our taxonomies are goal-driven, which means that we provide a rationale for the decisions taken. The leaves of the taxonomies are the types of packages available in the market, whilst the intermediate nodes are categories that group them when closer relationships are found. The proposed taxonomies are not defined from the scratch but applying the appropriate transformation rules to some departing classification available. We define the syntactic form of the rules and also their applicability conditions as properties on the involved goals. We apply them to a particular case, a taxonomy for business applications.

Claudia Ayala, Xavier Franch
Approximations of Concept Based on Multielement Bounds

Ontology-based information system can increase the precise and recall of information retrieval and filtering on the Web but faces the problem of Ontology heterogeneity. The approximate information filtering approach rewrites queries with their approximations to suit specific ontology-based information sources. The core of the approach is to find approximations of concepts. The current methods find least upper bounds and greatest lower bounds of a concept and then use them to get upper and lower approximations of the concept. However, they consider the bounds only containing separate concepts, so the quality of approximations is not very acceptable. This paper defines multielement least upper bounds and multielement greatest lower bounds that contain not only separate concepts but also disjunctions or conjunctions of concepts. It can be proved that the approximations based on them can be better approximations of concepts than the current methods.

Jianjiang Lu, Baowen Xu, Dazhou Kang, Yanhui Li, Peng Wang
Query Expansion Using Web Access Log Files

Query Expansion has long been recognized as one of the effective methods in solving short queries and improving ranking accuracy in traditional IR research. Many variations of this method have been introduced throughout the past decades; however, few of them have incorporated web log information into the query expansion process. In this paper, we propose an expansion technique that expands document content at the initial index stage using queries extracted from the web log files. Our experimental results show that even with a minimal amount of real world log information available and a professionally cataloged knowledge structure to aid the search, there is still a significant improvement in using our query expansion method compared to the conventional query expansion ones.

Yun Zhu, Le Gruenwald
An XML Approach to Semantically Extract Data from HTML Tables

Data intensive information is often published on the internet in the format of HTML tables. Extracting some of the information that is of users’ interest from the internet, especially when large number of web pages need to be accessed, is time consuming. To automate the processes of information extraction, this paper proposes an XML way of semantically analyzing HTML tables for the data od interest. It firstly introduces a mini language in XML syntax for specifying ontologies that represent the data of interest. Then it defines algorithms that parse HTML tables to a specially defined type of XML trees. The XML trees are then compared with the ontologies to semantically analyze and locate the part of table or nested tables that have the interesting data. Finally, interesting data, once identified, is output as XML documents.

Jixue Liu, Zhuoyun Ao, Ho-Hyun Park, Yongfeng Chen
Automatic Generation of Semantic Fields for Resource Discovery in the Semantic Web

In this paper we present and evaluate two approaches for the generation of Semantic Fields, which are used as a tool for resource discovery in the Semantic Web. We mainly concern ourselves with semantic networks that describe their interests and resources by means of ontologies. Semantic Fields are intended to help users to locate these resources by specifying a brief description (also as an ontology). We propose two ways to automatically build Semantic Fields. The first one is used in the KREIOS approach, which is based on the pre-computation of distances between all the ontology pairs. The second one is based on a fast incremental clustering algorithm, which groups together similar ontologies as they are published. These groups constitute a pre-computed set of Semantic Fields.

I. Navas, I. Sanz, J. F. Aldana, R. Berlanga
JeromeDL – Adding Semantic Web Technologies to Digital Libraries

In recent years more and more information has been made available on the Web. High quality information is often stored in dedicated databases of digital libraries, which are on their way to become expanding islands of well organized information. However, managing this information still poses challenges. The Semantic Web provides technologies that are about help to meet these challenges.

In this article we present JeromeDL, a full fledged open-source digital library system. We exemplify how digital library content management can benefit from the Semantic Web. We define and evaluate browsing and searching features. We describe how the semantic descriptions of resources and users profiles improve the usability of a digital library. We present how digital libraries can be interconnected into one heterogeneous database with use of semantic technologies.

Sebastian Ryszard Kruk, Stefan Decker, Lech Zieborak
Analysis and Visualization of the DX Community with Information Extracted from the Web

The aim of the present work is the graphical representation of the structures of a specific knowledge area with data extracted from the Web. Recent Internet development has facilitated access to these new resources and has contributed towards the creation of new disciplines which are aimed at taking full advantage of these resources. The main obstacles to this exploitation are their location and processing. This paper defines a generic architecture which solves the problems of processing these resources by combining techniques of extraction, analysis and visualization of data. Specifically in this work we will automatically explore two of the most important structures which define the DX community: the subjects within the community which generate the greatest interest and its social network. Graphical representations are presented to facilitate the analysis of this community.

F. T. de la Rosa, M. T. Gómez-López, R. M. Gasca
Learning Robust Web Wrappers

A main challenge in wrapping web data is to make wrappers robust w.r.t. variations in HTML sources, reducing human effort as much as possible. In this paper we develop a new approach to speed up the specification of robust wrappers, allowing the wrapper designer to not care about detailed definition of extraction rules. The key-idea is to enable a schema-based wrapping system to automatically generalize an original wrapper w.r.t. a set of example HTML documents. To accomplish this objective, we propose to exploit the notions of extraction rule and wrapper subsumption for computing a most general wrapper which still shares the extraction schema with the original wrapper, while maximizes the generalization of extraction rules w.r.t. the set of example documents.

B. Fazzinga, S. Flesca, A. Tagarelli
Control-Based Quality Adaptation in Data Stream Management Systems

Unlike processing snapshot queries in a traditional DBMS, the processing of continuous queries in a data stream management system (DSMS) needs to satisfy quality requirements such as processing delay. When the system is overloaded, quality degrades significantly thus load shedding becomes necessary. Maintaining the quality of queries is a difficult problem because both the processing cost and data arrival rate are highly unpredictable. We propose a quality adaptation framework that adjusts the application behavior based on the current system status. We leverage techniques from the area of control theory in designing the quality adaptation framework. Our simulation results demonstrate the effectiveness of the control-based quality adaptation strategy. Comparing to solutions proposed in previous works, our approach achieves significantly better quality with less waste of resources.

Yi-Cheng Tu, Mohamed Hefeeda, Yuni Xia, Sunil Prabhakar, Song Liu
Event Composition and Detection in Data Stream Management Systems

There has been a rising need to handle and process streaming kind of data. It is continuous, unpredictable, time-varying in nature and could arrive in multiple rapid streams. Sensor data, web clickstreams, etc. are the examples of streaming data. One of the important issues about streaming data management systems is that it needs to be processed in real-time. That is, active rules can be defined over data streams for making the system reactive. These rules are triggered based on the events detected on the data stream, or events detected while summarizing the data or combination of both. In this paper, we study the challenges involved in monitoring events in a Data Stream Management System (DSMS) and how they differ from the same in active databases. We propose an architecture for event composition and detection in a DSMS, and then discuss an algorithm for detecting composite events defined on both the summarized data streams and the streaming data.

Mukesh Mohania, Dhruv Swamini, Shyam Kumar Gupta, Sourav Bhowmick, Tharam Dillon
Automatic Parsing of Sports Videos with Grammars

Motivated by the analogies between languages and sports videos, we introduce a novel approach for video parsing with grammars. It utilizes compiler techniques for integrating both semantic annotation and syntactic analysis to generate a semantic index of events and a table of content for a given sports video. The video sequence is firstly segmented and annotated by semantic event detection with domain knowledge. A grammar-based parser is then used to identify the structure of the video content. Meanwhile, facilities for error handling are introduced which are particularly useful when the results of automatic parsing need to be adjusted. As a case study, we have developed a system for video parsing in the particular domain of TV diving programs. Experimental results indicate the proposed approach is effective.

Fei Wang, Kevin J. Lü, Jing-Tao Li, Jianping Fan
Improved Sequential Pattern Mining Using an Extended Bitmap Representation

The main challenge of mining sequential patterns is the high processing cost of support counting for large amount of candidate patterns. For solving this problem, SPAM algorithm was proposed in SIGKDD’2002, which utilized a depth-first traversal on the search space combined with a vertical bitmap representation to provide efficient support counting. According to its experimental results, SPAM outperformed the previous works SPADE and PrefixSpan algorithms on large datasets. However, the SPAM algorithm is efficient under the assumption that a huge amount of main memory is available such that its practicability is in question. In this paper, an Improved-version of SPAM algorithm, called I-SPAM, is proposed. By extending the structures of data representation, several heuristic mechanisms are proposed to speed up the efficiency of support counting further. Moreover, the required memory size for storing temporal data during mining process of our method is less than the one needed by SPAM. The experimental results show that I-SPAM can achieve the same magnitude efficiency and even better than SPAM on execution time under about half the maximum memory requirement of SPAM.

Chien-Liang Wu, Jia-Ling Koh, Pao-Ying An
Dimension Transform Based Efficient Event Filtering for Symmetric Publish/Subscribe System

There exists a class of publish/subscribe applications, such as recruitment, insurance, personal service, classified advertisement, electronic commerce, etc., where publisher needs the capability to select subscribers. Such kinds of publish/subscribe applications are called symmetric publish/subscribe system. The existing event matching algorithms designed for traditional publish/subscribe systems (called asymmetric publish/subscribe system) can not be applied to symmetric publish/ subscribe systems efficiently.

By extending the existing data model and algorithm, we propose an event matching method for symmetric publish/subscribe system based on dimension transform regarding the query in multidimensional space. An efficient underlying multidimensional index structure is chosen and verified. Our proposal is evaluated in a simulated environment. The results show that, our proposal outperforms the other possible solutions in one or two orders of magnitude. For a typical workload containing one million subscriptions with 16 attributes, an event can be filtered within several milliseconds and the subscription base can be updated within hundreds of microseconds. We can say that our proposal is efficient and practical for symmetric publish/subscribe systems.

Botao Wang, Masaru Kitsuregawa
Scalable Distributed Aggregate Computations Through Collaboration

Computing aggregates over distributed data sets constitutes an interesting class of distributed queries. Recent advances in peer-to-peer discovery of data sources and query processing techniques have made such queries feasible and potentially more frequent. The concurrent execution of multiple and often identical distributed aggregate queries can place a high burden on the data sources. This paper identifies the scalability bottlenecks that can arise in large peer-to-peer networks from the execution of large numbers of aggregate computations and proposes a solution. In our approach peers are assigned the role of aggregate computation maintainers, which leads to a substantial decrease in requests to the data sources and also avoids duplicate computation by the sites that submit identical aggregate queries. Moreover, a framework is presented that facilitates the collaboration of peers in maintaining aggregate query results. Experimental evaluation of our design demonstrates that it achieves very good performance and scales to thousands of peers.

Leonidas Galanis, David J. DeWitt
Schemas and Queries over P2P

Query processing in overlay networks has been receiving significant attention from researchers recently. Such systems offer flexibility and de-centralization. The challenge is to add representation and query capacity in such a networked environment, resulting in a peer-to-peer data management system. The issue of how schemas can be incorporated and queried in p2p while abiding to the de-centralized and flexible philosophy has not been properly handled. In this paper we analyze constructs and strategies to build database-like schemas and process simple queries over those schemas on P2P. For this we describe schema spaces, attribute value locators and data shipment issues. We also engage in experimental work to analyze our proposals using a network simulator to test the strategies with different networked scenarios.

Pedro Furtado
Threshold Based Declustering in High Dimensions

Declustering techniques reduce query response times through parallel I/O by distributing data among multiple devices. Except for a few cases it is not possible to find declustering schemes that are optimal for all spatial range queries. As a result of this, most of the research on declustering have focused on finding schemes with low worst case additive error. Recently, constrained declustering that maximizes the threshold

k

such that all spatial range queries ≤

k

buckets are optimal is proposed. In this paper, we extend constrained declustering to high dimensions. We investigate high dimensional bound diagrams that are used to provide upper bound on threshold and propose a method to find good threshold-based declustering schemes in high dimensions. We show that using replicated declustering with threshold N, low worst case additive error can be achieved for many values of N. In addition, we propose a framework to find thresholds in replicated declustering.

Ali Şaman Tosun
XG: A Data-Driven Computation Grid for Enterprise-Scale Mining

In this paper we introduce a novel architecture for data processing, based on a functional fusion between a data and a computation layer. We show how such an architecture can be leveraged to offer significant speedups for data processing jobs such as data analysis and mining over large data sets.

One novel contribution of our solution is its data-driven approach. The computation infrastructure is

controlled from within the data layer

. Grid compute job submission events are based within the query processor on the DBMS side and in effect controlled by the data processing job to be performed. This allows the early deployment of on-the-fly data aggregation techniques, minimizing the amount of data to be transfered to/from compute nodes and is in stark contrast to existing Grid solutions that interact with data layers mainly as external “storage”.

We validate this in a scenario derived from a real business deployment, involving financial customer profiling using common types of data analytics (e.g., linear regression analysis). Experimental results show significant speedups. For example, using a grid of only 12 non-dedicated nodes, we observed a speedup of approximately 1000% in a scenario involving complex linear regression analysis data mining computations for commercial customer profiling.

Radu Sion, Ramesh Natarajan, Inderpal Narang, Wen-Syan Li, Thomas Phan
A Rule System for Heterogeneous Spatial Reasoning in Geographic Information System

In this article, we investigate the hybrid spatial reasoning problem in geographic information system, which can be formulated as a hybrid formalism, which combines two essential formalisms in qualitative spatial reasoning: topological formalism and cardinal direction formalism. Although much work has been done in developing composition tables for these formalisms, the previous research for integrating heterogeneous formalisms was not sufficient. Instead of using conventional composition tables, we investigate the interactions between topological and cardinal directional relations with the aid of rules that are used efficiently in many research fields such as content-based image retrieval. These rules are shown to be sound, i.e. the deductions are logically correct. Based on these rules, an improved constraint propagation algorithm is introduced to enforce the path consistency.

Haibin Sun, Wenhui Li
Querying a Polynomial Object-Relational Constraint Database in Model-Based Diagnosis

Many papers related to Constraint Databases (CDBs) theories exist, including proposals that present frameworks for the treatment of constraints as a new data type. Our proposal presents a new way of storing and manipulating constraints as a usual data, and of making queries about the constraint variables derived from an Object-Relational Constraint Database (ORCDB). In this work, the constraints stored in an ORCDB are only polynomial equality constraints. The proposal is based on Gröbner bases, constraint consistency and constraint optimisation techniques. Most works in CDB use spatial-temporal data as a case study, however this work presents an emergent engineering domain, that of fault diagnosis.

M. T. Gómez-López, R. M. Gasca, C. Del Valle, F. T. de la Rosa
A Three-Phase Knowledge Extraction Methodology Using Learning Classifier System

Machine learning methods such as fuzzy logic, neural networks and decision tree induction have been applied to learn rules but they may be trapped into local optimal. Based on the principle of natural evolution and global searching, a genetic algorithm is promising in obtaining better results. This article adopts learning classifier systems (LCS) technique to provide a three-phase knowledge extraction methodology, which makes continues and instant learning while integrates multiple rule sets into a centralized knowledge base. This paper makes three important contributions: (1) it represents various rule sets that are derived from different sources and encoded as a fixed-length bit string in the knowledge encoding phase; (2) it uses three criteria (accuracy, coverage, and fitness) to select an optimal set of rules from a large population in the knowledge extraction phase; (3) it applies genetic operations to generate optimal rule sets in the knowledge integration phase. The experiments prove the rule sets derived by the proposed approach is more accurate than other machine learning algorithm.

An-Pin Chen, Kuang-Ku Chen, Mu-Yen Chen
A Replica Allocation Method Adapting to Topology Changes in Ad Hoc Networks

In ad hoc networks, data accessibility decreases due to network divisions. To solve this problem, it is effective that each mobile host creates replicas of data items held by others. In this paper, we assume an environment where mobility and access characteristics of mobile hosts have the locality and propose a method that locally relocates replicas just before a network division occurs.

Hideki Hayashi, Takahiro Hara, Shojiro Nishio
On a Collaborative Caching in a Peer-to-Peer Network for Push-Based Broadcast

In this paper, we propose a new collaborative caching strategy in a push-based broadcast environment where clients construct a peer-to-peer network by connecting with each other. In the proposed strategy, a client takes into account its own access probabilities and information on queries issued by other clients, and caches data items with large benefits of the response time. We confirm that the proposed strategy reduces the average response time by simulation experiments.

Kazuhiko Maeda, Wataru Uchida, Takahiro Hara, Shojiro Nishio
An Efficient Location Encoding Method Based on Hierarchical Administrative District

Due to the rapid development in mobile communication technologies, the usage of mobile devices such as cell phone or PDA becomes increasingly popular. As different devices require different applications, various new services are being developed to satisfy the needs. One of the popular services under heavy demand is the Location-based Service (LBS) that exploits the spatial information of moving objects per temporal changes. In order to support LBS efficiently, it is necessary to be able to index and query well a large amount of spatio-temporal information of moving objects. Therefore, in this paper, we investigate how such location information of moving objects can be efficiently stored and indexed. In particular, we propose a novel location encoding method based on hierarchical administrative district information. Our proposal is different from conventional approaches where moving objects are often expressed as geometric points in two dimensional space, (x, y). Instead, in ours, moving objects are encoded as one dimensional points by both administrative district as well as road information. Our method is especially useful for monitoring traffic situation or tracing location of moving objects through approximate spatial queries.

SangYoon Lee, Sanghyun Park, Woo-Cheol Kim, Dongwon Lee
Personalized and Community Decision Support in eTourism Intermediaries

The rapidly growing web technologies and electronic commerce applications have stimulated the need of personalized and group decision support functionalities in eTourism intermediary systems. The goal of this paper is to propose a functional framework and design process for building a web-based customer-oriented decision support system to facilitate personalized and community tourism services. Major decision support functions include personalized data and model management, information search and navigation, product/vendor evaluation and recommendation, do-it-yourself travel planning and design, community and collaboration management, auction and negotiation, as well as trip tracking and quality control. A system implementation approach as well as a system prototype will also be presented and discussed.

Chien-Chih Yu
Reengineering the Knowledge Component of a Data Warehouse-Based Expert Diagnosis System

We describe the weaknesses of an existing expert diagnosis-recom mendation system we have developed for SMEs. In good part, these weaknesses are related to the fact that the system was not implemented with appropriate artificial intelligence techniques. We recently decided to tackle the problem and re-engineered the core of the system with the help of an up-to-date expert system shell. In the process, we revised the formaliza tion and reorganization of the system’s expertise and developed a brand new knowledge base. We here describe the new system and the improvements made, and we identify ongoing and future developments.

Jean-François Beaudoin, Sylvain Delisle, Mathieu Dugré, Josée St-Pierre
A Model-Based Monitoring and Diagnosis System for a Space-Based Astrometry Mission

Space-based astrometry missions can achieve an accuracy that has not been possible before (by ground-based observations). However, in order to guarantee this precision, it is of the utmost importance to check the scientific quality of the data constantly.We present a modelbased monitoring system, called Science Quick Look, that is able to carry out the preliminary scientific assessment. We have implemented a prototype of the system and show the results of an evaluation.

Aleksei Pavlov, Sven Helmer, Guido Moerkotte
An Effective Method for Locally Neighborhood Graphs Updating

Neighborhood graphs are an effective and very widespread technique in several fields. But, in spite of the neighborhood graphs interest, their construction algorithms suffer from a very high complexity what prevents their implementation for great data volumes processing applications. With this high complexity, the update task is also affected. These structures constitute actually a possible representation of the point location problem in a multidimensional space. The point location on an axis can be solved by a binary research. This same problem in the plan can be solved by using a voronoi diagram, but when dimension becomes higher, the location becomes more complex and difficult to manage. We propose in this paper an effective method for point location in a multidimensional space with an aim of effectively and quickly updating neighborhood graphs.

Hakim Hacid, Abdelkader Djamel Zighed
Efficient Searching in Large Inheritance Hierarchies

Inheritance hierarchies have become more and more complex according to an enlargement of object-oriented technology. One of the main problems is the effective searching in such hierarchies. More sophisticated algorithms are needed to searching in the data. In this article we present a novel approach to efficient searching in large inheritance hierarchies. The updatable approach employs the multi-dimensional data structures to indexing inheritance hierarchies and effective searching in the data.

Michal Krátký, Svatopluk Štolfa, Václav Snášel, Ivo Vondrák
Backmatter
Metadaten
Titel
Database and Expert Systems Applications
herausgegeben von
Kim Viborg Andersen
John Debenham
Roland Wagner
Copyright-Jahr
2005
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-31729-6
Print ISBN
978-3-540-28566-3
DOI
https://doi.org/10.1007/11546924

Premium Partner