Invited Talk

Towards Engineering Purposeful Systems: A Requirements Engineering Perspective

A number of studies [1][2][3] show that systems fail due to an inadequate or insufficient understanding of the requirements they seek to address. Further, the amount of effort needed to fix these systems has been found to be very high [4]. To correct this situation, it is necessary to address the issue of requirements elicitation, validation, and specification in a relatively more focused manner. The expectation is that as a result of this, more acceptable systems will be developed in the future. The field of requirements engineering has emerged to meet this expectation. Requirements Engineering extends the ‘what is done by the system’ approach with the ‘why is the system like this’ view. This ‘why’ question is answered in terms of organizational objectives and their impact on information systems supporting the organization. In other words, information systems are seen as fulfilling a certain purpose in an organization and requirements engineering helps in the conceptualization of these purposeful systems.

The talk will focus on the above issue of conceptualising purposeful systems.

It will argue that the goal concept is central to resolving the issue of conceptualising purposeful systems and then, elaborate on goal modeling and reasoning with goals in order to demonstrate the various roles of goals in conceptualizing purposeful systems: Goal modeling proved to be an effective way to elicit requirements [7][8][9][10][11][12][13]. The assumption of goal-based requirements elicitation is that the rationale for developing a system is found outside the system itself, in the enterprise [14] in which the system shall function. In this movement from the ‘whats’ to the ‘whys’, it becomes mandatory to consider multiple view points of the various stakeholders, to explore alternative design choices and reason about them so as to make conceptual decisions on the basis of rationale arguments in favour and against the different alternatives. Alternative goal refinement proved helpful in the systematic exploration of system choices [8][14][15][16]. Recording these shall help to deal with changing requirements. Goals provide a means to ensure requirements pre-traceability [5][17][18]. They establish a conceptual link between the system and its environment, thus facilitating the propagation of organizational changes into the system functionality. This link provides the rationale for the system functionality [6][8][19][20][21] and facilitates the S.S. Bhowmick, J. Küng, and R. Wagner (Eds.): DEXA 2008, LNCS 5181, pp. 1-4, 2008. © Springer-Verlag Berlin Heidelberg 2008 2 C. Rolland explanation and justification of it to the stakeholders. Stakeholders provide useful and realistic viewpoints about the system To-Be expressed as goals. Negotiation techniques have been developed to help choosing the prevalent one [22][23]. Prioritization techniques aim at providing means to compare the different viewpoints on the basis of costs and value [24][25]. Multiple viewpoints are inherently associated to conflicts [26] and goals have been recognized to help in the detection of conflicts and their resolution [27][28][29][30].

In the rest of this talk, we will consider new challenges raised by emerging conditions of system development leading to variability in functionality modelling and customisation in the engineering process. Variability is imposed by the multi-purpose nature of information systems of today. We will use a particular goal model called goal/strategy map to illustrate how a goal model can make variability explicit and support goal-based reasoning to help in selecting the right variant for the project at hand. Our position is that variability implies a move from systems with a mono-facetted purpose to those with a multi-facetted purpose. Whereas the former concentrates on goal discovery, the multi-facetted nature of a purpose extends it to consider the many different ways of goal achievement. For example, for the goal Purchase Material, earlier it would be enough to know that an organization achieves this goal by forecasting material need. Thus, Purchase material was mono-facetted: it had exactly one strategy for its achievement. However, in the new context, it is necessary to introduce other strategies as well, say the Reorder Point strategy for purchasing material. Purchase Material now is multi-facetted; it has many strategies for goal achievement. These two strategies, among others, are made available, for example, in the SAP Materials Management module [31].

The foregoing points to the need to balance goal-orientation with the introduction of strategies for goal achievement. This is the essence of goal/strategy maps. A goal/strategy map, or map for short, is a graph, with nodes as intentions and strategies as edges. An edge entering a node identifies a strategy that can be used for achieving the intention of the node. The map therefore, shows which intentions can be achieved by which strategies once a preceding intention has been achieved. Evidently, the map is capable of expressing goals and their achievement in a declarative manner. The talk will introduce the concept of a map [32], illustrate it with an ERP system example and discuss how the model meets the aforementioned challenges.

Colette Rolland

Session 1

Data Privacy

Hiding Frequent Patterns under Multiple Sensitive Thresholds

Frequent pattern mining is a popular topic in data mining. With the advance of this technique, privacy issues attract more and more attention in recent years. In this field, previous works based hiding sensitive information on a uniform support threshold or a disclosure threshold. However, in practical applications, we probably need to apply different support thresholds to different itemsets for reflecting their significance. In this paper, we propose a new hiding strategy to protect sensitive frequent patterns with multiple sensitive thresholds. Based on different sensitive thresholds, the sanitized dataset is able to highly fulfill user requirements in real applications, while preserving more information of the original dataset. Empirical studies show that our approach can protect sensitive knowledge well not only under multiple thresholds, but also under a uniform threshold. Moreover, the quality of the sanitized dataset can be maintained.

Ya-Ping Kuo, Pai-Yu Lin, Bi-Ru Dai

BSGI: An Effective Algorithm towards Stronger l-Diversity

To reduce the risk of privacy disclosure during personal data publishing, the approach of anonymization is widely employed. On this topic, current studies mainly focus on two directions: (1)developing privacy preserving models which satisfy certain constraints, such as

k

-anonymity,

l

-diversity, etc.; (2)designing algorithms for certain privacy preserving model to achieve better privacy protection as well as less information loss. This paper generally belongs to the second class. We introduce an effective algorithm “

BSGI

” for the widely accepted privacy preserving model:

l

-diversity. In the meantime, we propose a novel interpretation of

l

-diversity: Unique Distinct

l

-diversity, which can be properly achieved by

BSGI

. We substantiate it’s a stronger

l

-diversity model than other interpretations. Related to the algorithm, we conduct the first research on the optimal assignment of parameter

l

according to certain dataset. Extensive experimental evaluation shows that Unique Distinct

l

-diversity provides much better protection than conventional

l

-diversity models, and

BSGI

greatly outperforms the state of the art in terms of both efficiency and data quality.

Yang Ye, Qiao Deng, Chi Wang, Dapeng Lv, Yu Liu, Jianhua Feng

Temporal, Spatial, and High Dimensional Databases I

The Truncated Tornado in TMBB: A Spatiotemporal Uncertainty Model for Moving Objects

The uncertainty management problem is one of the key issues associated with moving objects (

MO

s). Minimizing the uncertainty region size can increase both query accuracy and system performance. In this paper, we propose an uncertainty model called the

Truncated

Tornado

model as a significant advance in minimizing uncertainty region sizes. The

Truncated

Tornado

model removes uncertainty region sub-areas that are unreachable due to the maximum velocity and acceleration of the

MO

s. To make indexing of the uncertainty regions more tractable we utilize an approximation technique called

Tilted

Minimum

Bounding

Box

(

TMBB

) approximation. Through experimental evaluations we show that

Truncated

Tornado

in

TMBB

results in orders of magnitude reduction in volume compared to a recently proposed model called the

Tornado

model and to the standard “Cone” model when approximated by axis-parallel

MBB

.

Shayma Alkobaisi, Petr Vojtěchovský, Wan D. Bae, Seon Ho Kim, Scott T. Leutenegger

Reordering of Location Identifiers for Indexing an RFID Tag Object Database

The query performance for tracing tags depends upon the distribution of tag trajectories in the data space. We examine a more efficient representation of tag trajectories by means of ordering the set of domain values. Our analysis shows that the order of Location IDentifiers (LIDs) can give a great impact on the efficiency of query processing. To find out the optimal ordering of LIDs, we propose a new

LID proximity function

to rearrange an arbitrary order of LIDs. This function enables logically adjacent tag trajectories, which are frequently accessed together, to be stored in close proximity on the disk. To determine the optimal sequence of LIDs in the domain, we also propose a

reordering scheme of LIDs

. The experimental results show that the proposed reordering scheme achieves better query processing than the previous methods of assigning LIDs.

Sungwoo Ahn, Bonghee Hong

A Free Terrain Model for Trajectory K–Anonymity

This paper introduces a privacy model for location based services that utilizes collected movement data to identify parts of the user trajectories, where user privacy is at an elevated risk. To protect the privacy of the user, the proposed methodology transforms the original requests into anonymous counterparts by offering trajectory

K

–anonymity. As a proof of concept, we build a working prototype that implements our solution approach and is used for experimentation and evaluation purposes. Our implementation relies on a spatial DBMS that carries out part of the necessary analysis. Through experiments we demonstrate the effectiveness of our approach to preserve the

K

–anonymity of the users for as long as the requested services are in progress.

Aris Gkoulalas-Divanis, Vassilios S. Verykios

HRG: A Graph Structure for Fast Similarity Search in Metric Spaces

Indexing is the most effective technique to speed up queries in databases. While traditional indexing approaches are used for exact search, a query object may not be always identical to an existing data object in similarity search. This paper proposes a new dynamic data structure called Hypherspherical Region Graph (HRG) to efficiently index a large volume of data objects as a graph for similarity search in metric spaces. HRG encodes the given dataset in a smaller number of vertices than the known graph index, Incremental-RNG, while providing flexible traversal without incurring backtracking as observed in tree-based indices. An empirical analysis performed on search time shows that HRG outperforms Incremental-RNG in both cases. HRG, however, outperforms tree-based indices in range search only when the data dimensionality is not so high.

Omar U. Florez, SeungJin Lim

Session 2A: Semantic Web and Ontologies

Word Sense Disambiguation as the Primary Step of Ontology Integration

The recommendable primary step of ontology integration is annotation of ontology components with entries from WordNet or other dictionary sources in order to disambiguate their meaning. This paper presents an approach to automatically disambiguating the meaning of OWL ontology classes by providing sense annotation from WordNet. A class name is disambiguated using the names of the related classes, by comparing the taxonomy of the ontology with the portions of the WordNet taxonomy corresponding to all possible meanings of the class. The equivalence of the taxonomies is expressed by a probability function called affinity function. We apply two different basic techniques to compute the affinity coefficients: one based on semantic similarity calculation and the other on analyzing overlaps between word definitions and hyponyms. A software prototype is provided to evaluate the approach, as well as to determine which of the two disambiguation techniques produces better results.

Marko Banek, Boris Vrdoljak, A Min Tjoa

Enriching Ontology for Deep Web Search

This paper addresses the problems of extracting instances from the Deep Web, enriching a domain specific ontology with those instances, and using this ontology to improve Web search. Extending an existing ontology with a large number of instances extracted from the Deep Web is an important process for making the ontology more usable for indexing of Deep Web sites. We demonstrate how instances extracted from the Deep Web are used to enhance a domain ontology. We show the contribution of the enriched ontology to Web search effectiveness. This is done by comparing the number of relevant Web sites returned by a search engine with a user’s search terms only, with the Web sites found when using additional ontology-based search terms. Experiments suggest that the

ontology plus instances

approach results in more relevant Web sites among the first 100 hits.

Yoo Jung An, Soon Ae Chun, Kuo-chuan Huang, James Geller

POEM: An Ontology Manager Based on Existence Constraints

Whereas building and enriching ontologies are complex and tedious tasks, only few tools propose a complete solution to assist users. This paper presents POEM (Platform for Ontology crEation and Management) which provides a complete environment for ontology building, merge and evolution. POEM’s algorithms are based on existence constraints and Boolean equations that allow to capture the semantic of concepts. POEM also integrates a semantic fusion that relies on WordNet and supports different storage formats. The different functionalities are implemented as a set of web services.

Nadira Lammari, Cédric du Mouza, Elisabeth Métais

Session 2B: Query Processing

Extending Inconsistency-Tolerant Integrity Checking by Semantic Query Optimization

The premise that all integrity constraints be totally satisfied guarantees the correctness of methods for simplified integrity checking, and of semantic query optimization. This premise can be waived for many integrity checking methods. We study how it can be relaxed also for applying semantic query optimization to an inconsistency-tolerant evaluation of simplified constraints.

Hendrik Decker

On the Evaluation of Large and Sparse Graph Reachability Queries

Given a directed graph

G

with

n

nodes and

e

edges, to check whether a node

v

is reachable from another node

u

through a path is often required. Such a problem is fundamental to numerous applications, including geographic navigation, Internet routing, ontology queries based on

RDF/OWL

, and metabolic network, as well as XML indexing. Among them, some involve huge but sparse graphs and require fast answering of reachability queries. In this paper, we propose a novel method called core labeling to handle reachability queries for massive, sparse graphs. The goal is to optimize both query time and labeling time. Our method consists of two schemes: Core-I and Core-II. For the Core-I labeling scheme, both the time and space requirements are bounded by O(

n

+

e

+

s

·

b

), where

s

is the number of the start nodes of all non-tree edges (edges that do not appear in the spanning tree of

G

)

;

and

b

is the width of a subgraph of

G

. The size of that subgraph is bounded by O(

t

), where

t

is the number of all the non-tree edges. The query time of Core-I is bounded by O(log

b

). The Core-II labeling scheme has constant query time, but the labeling time is increased to O(

n

+

e

+

s

·

b

·

logb

).

Yangjun Chen

SQL TVF Controlling Forms - Express Structured Parallel Data Intensive Computing

A key issue in supporting the synthesis of data intensive computation and data management is to liberate users from low-level parallel programming, by specifying applications functionally independent of the underlying server infrastructure, and further, by providing high-level primitives to express the control flow of applying functions to data partitions. Currently only few such primitives, e.g. Map-Reduce and Cross-Apply, are available, and their expressive power is limited to “flat parallel computing”. To deal with “structured parallel computing” where a function is applied to multiple objects with execution order dependencies, a general framework for creating and combining such primitives is required.

We propose the SQL-FCF framework as the database centric solution to the above problem. We embed into SQL queries the Function Controlling Forms (FCFs) to specify the flow control of applying Table Valued Functions (TVFs) to multiple data partitions. We further support the extensibility of this framework by allowing new FCFs to be defined from existing ones with SQL phrases. Based on this approach, we provided a SQL based high-level interface for “structured parallel computing” in architecting a hydrologic scientific computation platform. Envisioning that the simple parallel computing primitives will evolve and form a general framework, our effort is a step towards that goal.

Qiming Chen, Meichun Hsu

A Decidable Fuzzy Description Logic F-ALC(G)

The existing Web Ontology languages and the corresponding description logics don’t support customized data type information. Furthermore, they can’t process imprecise and uncertain information in the Semantic Web. In order to solve these limitations, we propose a new kind of fuzzy description logic,

F

-

ALC

(

G

), which can not only support the representation and reasoning of fuzzy concept knowledge, but also support fuzzy data information with customized fuzzy data types and customized fuzzy data type predicates.

Hailong Wang, Z. M. Ma

Session 3: Web and Information Retrieval

Ranking Entities Using Comparative Relations

This paper proposes a method for ranking entities (e.g. products, people, etc.) that uses the comparative sentences described in text such as reviews, blogs, etc. as an indicator of an individual entity’s value. A comparative sentence expresses a relation between two entities. The comparative sentence “The quality of A is better than that of B” is expressed by the comparative relation {

A

,

B

,

quality

,

better

}. Given a query (set of queries), the proposed method automatically finds the competitive entities and extracts the comparative relations among them. From the vast amount of comparative relations so extracted, the proposed method then generates a graph modeling the behavior of a “potential customer” to assess the relative importance of every entity against its competitors. These ranking results help potential customers to know the position of the entity among related entities and decide which one to choose.

Takeshi Kurashima, Katsuji Bessho, Hiroyuki Toda, Toshio Uchiyama, Ryoji Kataoka

Query Recommendation Using Large-Scale Web Access Logs and Web Page Archive

Query recommendation suggests related queries for search engine users when they are not satisfied with the results of an initial input query, thus assisting users in improving search quality. Conventional approaches to query recommendation have been focused on expanding a query by terms extracted from various information sources such as a thesaurus like WordNet, the top ranked documents and so on. In this paper, we argue that past queries stored in query logs can be a source of additional evidence to help future users. We present a query recommendation system based on large-scale Web access logs and Web page archive, and evaluate three query recommendation strategies based on different feature spaces (i.e., noun, URL, and Web community). The experimental results show that query logs are an effective source for query recommendation, and the Web community-based and noun-based strategies can extract more related search queries than the URL-based one.

Lin Li, Shingo Otsuka, Masaru Kitsuregawa

Description Logic to Model a Domain Specific Information Retrieval System

In professional environments which are characterized by a domain (Medicine, Law, etc.), information retrieval systems must be able to process

precise queries

, mostly because of the use of a specific domain terminology, but also because the retrieved information is meant to be part of the professional task (a diagnosis, writing a law text, etc.). In this paper we address the problem of solving domain-specific precise queries. We present an information retrieval model based on description logics to represent external knowledge resources and provide expressive document indexing and querying.

Saïd Radhouani, Gilles Falquet, Jean-Pierre Chevalletinst

Extending the Edit Distance Using Frequencies of Common Characters

Similarity search of time series has attracted many researchers recently. In this scope, reducing the dimensionality of data is required to scale up the similarity search. Symbolic representation is a promising technique of dimensionality reduction, since it allows researchers to benefit from the richness of algorithms used for textual databases. To improve the effectiveness of similarity search we propose in this paper an extension to the edit distance that we call the extended edit distance. This new distance is applied to symbolic sequential data objects, and we test it on time series data bases in classification task experiments. We also prove that our distance is a metric.

Muhammad Marwan Muhammad Fuad, Pierre-François Marteau

Session 4: Mobile Data and Information

Tracking Moving Objects in Anonymized Trajectories

Multiple target tracking (MTT) is a well-studied technique in the field of radar technology, which associates anonymized measurements with the appropriate object trajectories. This technique, however, suffers from combinatorial explosion, since each new measurement may potentially be associated with any of the existing tracks. Consequently, the complexity of existing MTT algorithms grows exponentially with the number of objects, rendering them inapplicable to large databases. In this paper, we investigate the feasibility of applying the MTT framework in the context of large trajectory databases. Given a history of object movements, where the corresponding object

ids

have been removed, our goal is to track the trajectory of every object in the database in successive timestamps. Our main contribution lies in the transition from an exponential solution to a polynomial one. We introduce a novel method that transforms the tracking problem into a min-cost max-flow problem. We then utilize well-known graph algorithms that work in polynomial time with respect to the number of objects. The experimental results indicate that the proposed methods produce high quality results that are comparable with the state-of-the-art MTT algorithms. In addition, our methods reduce significantly the computational cost and scale to a large number of objects.

Nikolay Vyahhi, Spiridon Bakiras, Panos Kalnis, Gabriel Ghinita

REALM: Replication of Data for a Logical Group Based MANET Database

Mobile Ad-Hoc Networks, or MANETs, provide communication between free-roaming mobile hosts without a fixed infrastructure. These MANETs operate under conditions of limited battery power and may also experience frequent network partitioning. This paper introduces a data replication technique called REALM (REplication of data for A Logical group based MANET database). REALM takes advantage of application semantics when determining where to place replicas of data items. It also tries to predict network partitioning and disconnection ahead of time and creates copies of data items accordingly. Experiments show that REALM increases the percentage of successful transactions with or without the presence of network disconnection and partitioning.

Anita Vallur, Le Gruenwald, Nick Hunter

A Cache Management Method for the Mobile Music Delivery System: JAMS

The authors have proposed a music delivery system JAMS (JAMais vu System), which uses ad-hoc networks with mobile devices for the localization services. In this paper, they focus on a cache management scheme for JAMS to deliver music files efficiently. The proposed scheme makes it possible to manage cache files robustly by using adaptive cache delivery according to each node request. In addition, the authors conduct a simulation experiment and obtain access failure ratio of music files under various system conditions. The simulation results indicate that the proposed scheme is suitable for the system using localization services.

Hiroaki Shibata, Satoshi Tomisawa, Hiroki Endo, Yuka Kato

EcoRare: An Economic Incentive Scheme for Efficient Rare Data Accessibility in Mobile-P2P Networks

We propose EcoRare, a novel economic incentive scheme for improving the availability of

rare data

in Mobile-P2P networks. EcoRare combats free-riding and effectively involves free-riders to improve the availability (and lifetimes) of rare data. EcoRare also facilitates the creation of multiple copies of rare items in the network since its novel selling mechanism allows a given data to have multiple owners. Our performance study demonstrates that EcoRare indeed improves query response times and availability of rare data items in Mobile-P2P networks.

Anirban Mondal, Sanjay Kumar Madria, Masaru Kitsuregawa

Session 5: Data and Information Streams

Identifying Similar Subsequences in Data Streams

Similarity search has been studied in a domain of time series data mining, and it is an important technique in stream mining. Since sampling rates of streams are frequently different, and their time period varies in practical situations, the method which deals with time warping such as Dynamic Time Warping (DTW) is suitable for measuring similarity. However, finding pairs of similar subsequences between co-evolving sequences is difficult due to increase of the complexity because DTW is a method for detecting sequences that are similar to a given query sequence.

In this paper, we focus on the problem of finding pairs of similar subsequences and periodicity over data streams. We propose a method to detect similar subsequences in streaming fashion. Our approach for measuring similarity relies on a proposed scoring function that incrementally updates a score, which is suitable for data stream processing. We also present an efficient algorithm based on the scoring function. Our experiments on real and synthetic data demonstrate that our method detects the pairs of qualifying subsequence correctly and that it is dramatically faster than the existing method.

Machiko Toyoda, Yasushi Sakurai, Toshikazu Ichikawa

A Tree-Based Approach for Event Prediction Using Episode Rules over Event Streams

Event prediction over event streams is an important problem with broad applications. For this problem, rules with predicate events and consequent events are given, and then current events are matched with the predicate events to predict future events. Over the event stream, some matches of predicate events may trigger duplicate predictions, and an effective scheme is proposed to avoid such redundancies. Based on the scheme, we propose a novel approach CBS-Tree to efficiently match the predicate events over event streams. The CBS-Tree approach maintains the recently arrived events as a tree structure, and an efficient algorithm is proposed for the matching of predicate events on the tree structure, which avoids exhaustive scans of the arrived events. By running a series of experiments, we show that our approach is more efficient than the previous work for most cases.

Chung-Wen Cho, Ying Zheng, Yi-Hung Wu, Arbee L. P. Chen

Effective Skyline Cardinality Estimation on Data Streams

In order to incorporate the skyline operator into the data stream engine, we need to address the problem of skyline cardinality estimation, which is very important for extending the query optimizer’s cost model to accommodate skyline queries. In this paper, we propose robust approaches for estimating the skyline cardinality over sliding windows in the stream environment. We first design an approach to estimate the skyline cardinality over uniformly distributed data, and then extend the approach to support arbitrarily distributed data. Our approaches allow arbitrary data distribution, hence can be applied to extend the optimizer’s cost model. To estimate the skyline cardinality in online manner, the live elements in the sliding window are sketched using Spectral Bloom Filters which can efficiently and effectively capture the information which is essential for estimating the skyline cardinality over sliding windows. Extensive experimental study demonstrates that our approaches significantly outperform previous approaches.

Yang Lu, Jiakui Zhao, Lijun Chen, Bin Cui, Dongqing Yang

Session 6: Data Mining Algorithms

Detecting Current Outliers: Continuous Outlier Detection over Time-Series Data Streams

The development of sensor devices and ubiquitous computing have increased time-series data streams. With data streams, current data arrives continuously and must be monitored. This paper presents outlier detection over data streams by continuous monitoring. Outlier detection is an important data mining issue and discovers outliers, which have features that differ profoundly from other objects or values. Most existing outlier detection techniques, however, deal with static data, which is computationally expensive. Specifically, for outlier detection over data streams, real-time response is very important. Existing techniques for static data, however, are fraught with many meaningless processes over data streams, and the calculation cost is too high. This paper introduces a technique that provides effective outlier detection over data streams using differential processing, and confirms effectiveness.

Kozue Ishida, Hiroyuki Kitagawa

Component Selection to Optimize Distance Function Learning in Complex Scientific Data Sets

Analyzing complex scientific data, e.g., graphs and images, often requires comparison of features: regions on graphs, visual aspects of images and related metadata, some features being relatively more important. The notion of similarity for comparison is typically distance between data objects which could be expressed as distance between features. We refer to distance based on each feature as a component. Weights of components representing relative importance of features could be learned using distance function learning algorithms. However, it is seldom known which components optimize learning, given criteria such as accuracy, efficiency and simplicity. This is the problem we address. We propose and theoretically compare four component selection approaches: Maximal Path Traversal, Minimal Path Traversal, Maximal Path Traversal with Pruning and Minimal Path Traversal with Pruning. Experimental evaluation is conducted using real data from Materials Science, Nanotechnology and Bioinformatics. A trademarked software tool is developed as a highlight of this work.

Aparna Varde, Stephen Bique, Elke Rundensteiner, David Brown, Jianyu Liang, Richard Sisson, Ehsan Sheybani, Brian Sayre

Emerging Pattern Based Classification in Relational Data Mining

The usage of descriptive data mining methods for predictive purposes is a recent trend in data mining research. It is well motivated by the understandability of learned models, the limitation of the so-called “horizon effect” and by the fact that it is a multi-task solution. In particular, associative classification, whose main idea is to exploit association rules discovery approaches in classification, gathered a lot of attention in recent years. A similar idea is represented by the use of emerging patterns discovery for classification purposes. Emerging Patterns are classes of regularities whose support significantly changes from one class to another and the main idea is to exploit class characterization provided by discovered emerging patterns for class labeling. In this paper we propose and compare two distinct emerging patterns based classification approaches that work in the relational setting. Experiments empirically prove the effectiveness of both approaches and confirm the advantage with respect to associative classification.

Michelangelo Ceci, Annalisa Appice, Donato Malerba

Session 7: Multimedia Databases

Rosso Tiziano: A System for User-Centered Exploration and Discovery in Large Image Information Bases

Retrieval in image information bases has been traditionally addressed by two different and unreconciled approaches: the first one uses normal query methods on metadata or on a textual description of each item. The second one works on low-level multimedia features (such as color, texture, etc.) and tries to find items that are similar to a specific selected item. Neither of these approaches supports the most common end-user task: the exploration of an information base in order to find the “right” items. This paper describes a prototype system based on dynamic taxonomies, a model for the intelligent exploration of heterogeneous information bases, and shows how the system implements a new access paradigm supporting guided exploration, discovery, and the seamless integration of access through metadata with methods based on low-level multimedia features. Example interactions are discussed, as well as the major implications of this approach.

Giovanni Maria Sacco

NM-Tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces

So far, an efficient similarity search in multimedia databases has been carried out by metric access methods (MAMs), where the utilized similarity measure had to satisfy the metric properties (reflexivity, non-negativity, symmetry, triangle inequality). Recently, the introduction of TriGen algorithm (turning any nonmetric into metric) enabled MAMs to perform also nonmetric similarity search. Moreover, it simultaneously enabled faster approximate search (either metric or nonmetric). However, a simple application of TriGen as the first step before MAMs’ indexing assumes a fixed “approximation level”, that is, a user-defined tolerance of retrieval precision is preset for the whole index lifetime. In this paper, we push the similarity search forward; we propose the NM-tree (nonmetric tree) – a modification of M-tree which natively aggregates the TriGen algorithm to support

flexible approximate

nonmetric or metric search. Specifically, at query time the NM-tree provides a user-defined level of retrieval efficiency/precision trade-off. We show the NM-tree could be used for general (non)metric search, while the desired retrieval precision can be flexibly tuned on-demand.

Tomáš Skopal, Jakub Lokoč

Efficient Processing of Nearest Neighbor Queries in Parallel Multimedia Databases

This paper deals with the performance problem of nearest neighbor queries in voluminous multimedia databases. We propose a data allocation method which allows achieving a

$0(\sqrt{n})$

query processing time in parallel settings. Our proposal is based on the complexity analysis of content based retrieval when it is used a clustering method. We derive a valid range of values for the number of clusters that should be obtained from the database. Then, to efficiently process nearest neighbor queries, we derive the optimal number of nodes to maximize parallel resources. We validated our method through experiments with different high dimensional databases and implemented a query processing algorithm for full

k

nearest neighbors in a shared nothing cluster.

Jorge Manjarrez-Sanchez, José Martinez, Patrick Valduriez

Session 8: Data Mining Systems, Data Warehousing, OLAP

OLAP for Trajectories

In this paper, we present an OLAP framework for trajectories of moving objects. We introduce a new operator GROUP_TRAJECTORIES for group-by operations on trajectories and present three implementation alternatives for computing groups of trajectories for group-by aggregation:

group by overlap

,

group by intersection

, and

group by overlap and intersection

. We also present an interactive OLAP environment for resolution drill-down/roll-up on sets of trajectories and parameter browsing. Using generated and real life moving data sets, we evaluate the performance of our GROUP_TRAJECTORIES operator. An implementation of our new interactive OLAP environment for trajectories can be accessed at

http://OLAP-T.cgmlab.org

.

Oliver Baltzer, Frank Dehne, Susanne Hambrusch, Andrew Rau-Chaplin

A Probabilistic Approach for Computing Approximate Iceberg Cubes

An

iceberg cube

is a refinement of a

data cube

containing the subset of cells whose measure is larger than a given threshold (

iceberg condition

). Iceberg cubes are well-established tools supporting fast data analysis, as they filter the information contained in classical data cubes to provide the most relevant pieces of information. Although the problem of efficiently computing iceberg cubes has been widely investigated, this task is intrinsically expensive, due to the large amount of data which must be usually dealt with. Indeed, in several application scenarios, efficiency is so crucial that users would benefit from a fast computation of even incomplete iceberg cubes. In fact, an incomplete iceberg cube could support preliminary data analysis by allowing users to focus their explorations quickly and effectively, thus saving large amounts of computational resources. In this paper, we propose a technique for efficiently computing iceberg cubes, possibly trading off the computational efficiency with the completeness of the result. Specifically, we devise an algorithm which employs a probabilistic framework to prevent cells which are probably irrelevant (i.e., which are unlikely to satisfy the iceberg condition) from being computed. The output of our algorithm is an incomplete iceberg cube, which is efficiently computed and prone to be refined, in the sense that the user can decide to go through the computation of the cells which were estimated irrelevant during the previous invocations of the algorithm.

Alfredo Cuzzocrea, Filippo Furfaro, Giuseppe M. Mazzeo

Noise Control Boundary Image Matching Using Time-Series Moving Average Transform

To achieve the noise reduction effect in boundary image matching, we exploit the

moving average transform

of time-series matching. Our motivation is that using the moving average transform we may reduce noise in boundary image matching as in time-series matching. We first propose a new notion of

k-order image matching

, which applies the moving average transform to boundary image matching. A boundary image can be represented as a sequence in the time-series domain, and our

k

-order image matching identifies similar boundary images in this time-series domain by comparing the

k

-moving average transformed sequences. Next, we propose an index-based method that efficiently performs

k

-order image matching on a large image database, and prove its correctness. Moreover, we present its index building and

k

-order image matching algorithms. Experimental results show that our

k

-order image matching exploits the noise reduction effect, and our index-based method outperforms the sequential scan by one or two orders of magnitude.

Bum-Soo Kim, Yang-Sae Moon, Jinho Kim

Approximate Range-Sum Queries over Data Cubes Using Cosine Transform

In this research, we propose to use the discrete cosine transform to approximate the cumulative distributions of data cube cells’ values. The cosine transform is known to have a good energy compaction property and thus can approximate data distribution functions easily with small number of coefficients. The derived estimator is accurate and easy to update. We perform experiments to compare its performance with a well-known technique - the (Haar) wavelet. The experimental results show that the cosine transform performs much better than the wavelet in estimation accuracy, speed, space efficiency, and update easiness.

Wen-Chi Hou, Cheng Luo, Zhewei Jiang, Feng Yan, Qiang Zhu

Session 9: Temporal, Spatial, and High Dimensional Databases II

Querying Multigranular Spatio-temporal Objects

The integrated management of both spatial and temporal components of information is crucial in order to extract significant knowledge from datasets concerning phenomena of interest to a large variety of applications. Moreover, multigranularity, i.e., the capability of representing information at different levels of detail, enhances the data modelling flexibility and improves the analysis of information, enabling to zoom-in and zoom-out spatio-temporal datasets. Relying on an existing multigranular spatio-temporal extension of the ODMG data model, in this paper we describe the design of a multigranular spatio-temporal query language. We extend OQL value comparison and object navigation in order to access spatio-temporal objects with attribute values defined at different levels of detail.

Elena Camossi, Michela Bertolotto, Elisa Bertino

Space-Partitioning-Based Bulk-Loading for the NSP-Tree in Non-ordered Discrete Data Spaces

Properly-designed bulk-loading techniques are more efficient than the conventional tuple-loading method in constructing a multidimensional index tree for a large data set. Although a number of bulk-loading algorithms have been proposed in the literature, most of them were designed for continuous data spaces (CDS) and cannot be directly applied to non-ordered discrete data spaces (NDDS). In this paper, we present a new space-partitioning-based bulk-loading algorithm for the NSP-tree — a multidimensional index tree recently developed for NDDSs . The algorithm constructs the target NSP-tree by repeatedly partitioning the underlying NDDS for a given data set until input vectors in every subspace can fit into a leaf node. Strategies to increase the efficiency of the algorithm, such as multi-way splitting, memory buffering and balanced space partitioning, are employed. Histograms that characterize the data distribution in a subspace are used to decide space partitions. Our experiments show that the proposed bulk-loading algorithm is more efficient than the tuple-loading algorithm and a popular generic bulk-loading algorithm that could be utilized to build the NSP-tree.

Gang Qian, Hyun-Jeong Seok, Qiang Zhu, Sakti Pramanik

Efficient Updates for Continuous Skyline Computations

We address the problem of maintaining

continuous skyline queries

efficiently over dynamic objects with

d

dimensions. Skyline queries are an important new search capability for multi-dimensional databases. In contrast to most of the prior work, we focus on the unresolved issue of frequent data object updates. In this paper we propose the

ESC

algorithm, an

E

fficient update approach for

S

kyline

C

omputations, which creates a pre-computed

second skyline

set that facilitates an efficient and incremental skyline update strategy and results in a quicker response time. With the knowledge of the

second skyline

set,

ESC

enables (1) to efficiently find the substitute skyline points from the

second skyline

set only when removing or updating a skyline point (which we call a first skyline point) and (2) to delegate the most time-consuming skyline update computation to another independent procedure, which is executed after the complete updated query result is reported. We leverage the basic idea of the traditional

BBS

skyline algorithm for our novel design of a two-threaded approach. The first skyline can be replenished quickly from a small set of second skylines - hence enabling a fast query response time - while de-coupling the computationally complex maintenance of the second skyline. Furthermore, we propose the

Approximate Exclusive Data Region

algorithm (

AEDR

) to reduce the computational complexity of determining a candidate set for second skyline updates. In this paper, we evaluate the

ESC

algorithm through rigorous simulations and compare it with existing techniques. We present experimental results to demonstrate the performance and utility of our novel approach.

Yu-Ling Hsueh, Roger Zimmermann, Wei-Shinn Ku

Session 10: Data and Information Semantics

Semantic Decision Tables: Self-organizing and Reorganizable Decision Tables

A Semantic Decision Table (SDT) provides a means to

capture

and

examine

decision makers’ concepts, as well as a tool for

refining

their decision knowledge and facilitating

knowledge

sharing

in a

scalable

manner. One challenge SDT faces is to organize decision resources represented in a tabular format based on the user’s needs at different levels. It is important to make it

self organized

and automatically

reorganized

when the requirements are updated. This paper describes the ongoing research on SDT and its tool that supports the self organizations and automatic reorganization of decision tables. We argue that

simplicity

,

precision

, and

flexibility

are the key issues to respond to the paper challenge. We propose a novel combination of the principles of Decision Support and Database Modeling, together with the modern technologies in Ontology Engineering, in the adaptive

s

elf-

o

rganization and

a

utomatic

r

eorganization procedures (SOAR).

Yan Tang, Robert Meersman, Jan Vanthienen

Translating SQL Applications to the Semantic Web

The content of most Web pages is dynamically derived from an underlying relational database. Thus, the success of the Semantic Web hinges on enabling access to relational databases and their content by semantic methods. We define a system for automatic transformation of SQL DDL schemas into OWL DL ontologies. This system goes further than earlier efforts in that the entire system is expressed in first-order logic. We leverage the formal approach to show the system is complete with respect to a space of the possible relations that can be formed among relational tables as a consequence of primary and foreign key combinations. The full set of transformation rules is stratified, thus the system can be executed directly by a Datalog interpreter.

Syed Hamid Tirmizi, Juan Sequeda, Daniel Miranker

An Agent Framework Based on Signal Concepts for Highlighting the Image Semantic Content

This paper addresses the image semantic gap (i.e. the difficulty to automatically characterize the image semantic content through extracted low-level signal features) by investigating the formation of semantic concepts (such as

mountains

,

sky

,

grass

...) in a population of

image agents

:abstract structures representing the image visual entities. Through the development of processes mapping extracted low-level features to concept-based visual information, our contribution is twofold. First, we propose a learning framework mapping signal (color, texture) and semantic concepts to highlight the image agents. Contrary to traditional architectures considering high-dimensional spaces of low-level extracted signal features, this framework addresses the

curse of dimensionality

. Then, at the image agent population level, the agents communicate about the perceived semantic concepts with no access to global information or to the representations of other agents, they only exchange conceptual information. While doing so they adapt their internal representations to be more successful at conveying the perceived semantic information in future interactions. The image content is therefore soundly inferred through these concept-based linguistic interactions.

The SIR Agent prototype implements our theoretical framework and its architecture revolves around functional modules enabling the characterization of concept-based linguistic structures, highlighting the image agents and enforcing interactions and coordination between them.

Mohammed Belkhatir

Session 11: XML Databases I

Supporting Proscriptive Metadata in an XML DBMS

MetaXQuery is a language for querying data enhanced with metadata. The MetaXQuery data model (MetaDOM) attaches metadata to each element in an XML data collection, and extends XQuery with several constructs to process and query metadata. In this paper we show how to extend a native XML DBMS, namely eXist, to support MetaXQuery. The additional query functionality can be efficiently implemented by judicious reuse of eXist’s indexes and query evaluation engine.

Hao Jin, Curtis Dyreson

XPath Rewriting Using Multiple Views

We study the problem of tree pattern query rewriting using multiple views for the class of tree patterns in

P

{//,[]}. Previous work has considered the rewriting problem using a single view. We consider two different ways of combining multiple views, define rewritings of a tree pattern using these combinations, and study the relationship between them. We show that when rewritings using single views do not exist, we may use such combinations of multiple views to rewrite a query, and even if rewritings using single views do exist, the rewritings using combinations of multiple views may provide more answers than those provided by the union of the rewritings using the individual views. We also study properties of intersections of tree patterns, and present algorithms for finding rewritings using intersections of views.

Junhu Wang, Jeffrey Xu Yu

Superimposed Code-Based Indexing Method for Extracting MCTs from XML Documents

With the exponential increase in the amount of XML data on the Internet, information retrieval techniques on tree-structured XML documents such as keyword search become important. The search results for this retrieval technique are often represented by minimum connecting trees (MCTs) rooted at the lowest common ancestors (LCAs) of the nodes containing all the search keywords. Recently, effective methods such as the stack-based algorithm for generating the lowest grouped distance MCTs (GDMCTs), which derive a more compact representation of the query results, have been proposed. However, when the XML documents and the number of search keywords become large, these methods are still expensive. To achieve more efficient algorithms for extracting MCTs, especially lowest GDMCTs, we first consider two straightforward LCA detection methods: keyword B

+

trees with Dewey-order labels and superimposed code-based indexing methods. Then, we propose a method for efficiently detecting the LCAs, which combines the two straightforward indexing methods for LCA detection. We also present an effective solution for the false drop problem caused by the superimposed code. Finally, the proposed LCA detection methods are applied to generate the lowest GDMCTs. We conduct detailed experiments to evaluate the benefits of our proposed algorithms and show that the proposed combined method can completely solve the false drop problem and outperforms the stack-based algorithm in extracting the lowest GDMCTs.

Wenxin Liang, Takeshi Miki, Haruo Yokota

Fast Matching of Twig Patterns

Twig pattern matching plays a crucial role in

${\sc xml}$

data processing. Existing twig pattern matching algorithms can be classified into two-phase algorithms and one-phase algorithms. While the two-phase algorithms (e.g.,

${\tt TwigStack}$

) suffer from expensive merging cost, the one-phase algorithms (e.g.,

${\tt TwigList}$

,

${\tt Twig^{2}Stack}$

,

${\tt HolisticTwigStack}$

) either lack efficient filtering of useless elements, or use over-complicated data structures. In this paper, we present two novel one-phase holistic twig matching algorithms,

TwigMix

and

${\tt TwigFast}$

, which combine the efficient selection of useful elements (introduced in

${\tt TwigStack}$

) with the simple lists for storing final solutions (introduced in

${\tt TwigList}$

).

${\tt TwigMix}$

simply introduces the element selection function of

${\tt TwigStack}$

into

${\tt TwigList}$

to avoid manipulation of useless elements in the stack and lists.

${\tt TwigFast}$

further improves this by introducing some pointers in the lists to completely avoid the use of stacks. Our experiments show

${\tt TwigMix}$

significantly and consistently outperforms

${\tt TwigList}$

and

${\tt HolisticTwigStack}$

(up to several times faster), and

${\tt TwigFast}$

is up to two times faster than

${\tt TwigMix}$

.

Jiang Li, Junhu Wang

Session 12: XML Databases II

XML Filtering Using Dynamic Hierarchical Clustering of User Profiles

Information filtering systems constitute a critical component in modern information seeking applications. As the number of users grows and the information available becomes even bigger it is crucial to employ scalable and efficient representation and filtering techniques. In this paper we propose an innovative XML filtering system that utilizes clustering of user profiles in order to reduce the filtering space and achieves sub-linear filtering time. The proposed system employs a unique sequence representation for user profiles and XML documents based on the depth-first traversal of the XML tree and an appropriate distance metric in order to compare and cluster the user profiles and filter the incoming XML documents. Experimental results depict that the proposed system outperforms the previous approaches in XML filtering and achieves sub-linear filtering time.

Panagiotis Antonellis, Christos Makris

Person Retrieval on XML Documents by Coreference Analysis Utilizing Structural Features

Keyword retrieval of the present day exploits frequencies and positions of search keywords in target documents. As for retrieval by two or more keywords, semantic relation between keywords is important. For retrieving information about a person, it is common to search by a pair of keywords consisting of person’s name and his/her attribute of the interest. By using dependency analysis and coreference analysis, correct occurrences of pairs of person and his/her attributes can be retrieved. However, existing natural language analysis does not consider the factor that logical structures of the documents strongly influence probabilistic patterns of coreference. In this paper, we propose a new way of person retrieval by computing a maximum entropy model from linguistic features and structural features, where structural features are learned from probabilistic distribution of coreference over XML document structures. Our method can utilize strong correlation between XML document structures and coreference, thus having superior accuracy than existing methods.

Yumi Yonei, Mizuho Iwaihara, Masatoshi Yoshikawa

HFilter: Hybrid Finite Automaton Based Stream Filtering for Deep and Recursive XML Data

XML filtering applications are gaining increasing popularity recently. Automata are generally adopted to construct query indexes for evaluating large numbers of XPath queries over XML streams. Usually only shallow data are observed in existing approaches. How to process deep and recursive XML data with low memory limitation efficiently is still a challenging issue. In this paper, we propose HFilter, a Hybrid Finite Automaton (HFA) based stream filtering approach, to solve this problem. We introduce the basic two-tier HFA (lazy DFA tier and NFA tier) first, which realizes data prefix sharing and memory overflow control to improve the filtering throughput. Then an optimized three-tier HFA with an extra pre-expanded DFA tier is put forward, which significantly reduces the restarting cost of HFA after memory overflow. Experiments show that our approaches work more efficiently than existing ones.

Weiwei Sun, Yongrui Qin, Ping Yu, Zhuoyao Zhang, Zhenying He

Session 13: Query Processing and Optimization

Read-Optimized, Cache-Conscious, Page Layouts for Temporal Relational Data

The efficient management of temporal data is crucial for many traditional and emerging database applications. A major performance bottleneck for database systems is the memory hierarchy. The performance of the memory hierarchy is directly related to how the content of disk pages maps to the L2 cache lines,

i.e.

to the organization of data within a page or the

page layout

. The prevalent page layout in database systems is the N-ary Storage Model (NSM). As demonstrated in this paper, using NSM for temporal data deteriorates memory hierarchy performance for query-intensive workloads. This paper proposes, new cache-conscious, read-optimized, page layouts specifically tailored for temporal data. The proposed page layouts optimize accesses to all levels of the memory hierarchy by avoiding fetching the same data several times (as opposed to NSM). Experiments show that the proposed page layouts are substantially faster than NSM.

Khaled Jouini, Geneviève Jomier, Patrick Kabore

Exploiting Interactions among Query Rewrite Rules in the Teradata DBMS

Query rewrite (QRW) optimizations apply algebraic transformations to a SQL query Q producing a SQL query Q’. Both Q and Q’ are semantically equivalent (i.e. they produce the same result) but the execution of Q’ is generally faster than that of Q. Folding views/derived tables, applying transitive closure on predicates, and converting outer joins to inner joins are some examples of QRW optimizations. In this paper, we carefully analyze the interactions among a number of rewrite rules and show how this knowledge is used to devise a triggering mechanism in the new Teradata extensible QRW subsystem thereby enabling efficient application of the rewrite rules. We also present results from experimental studies that show that, as compared to a conventional recognize-act cycle strategy, exploiting these interactions yields significant reduction in the time and space cost of query optimization while producing the same re-written queries.

Ahmad Ghazal, Dawit Seid, Alain Crolotte, Bill McKenna

Optimal Preference Elicitation for Skyline Queries over Categorical Domains

When issuing user-specific queries, users often have a vaguely defined information need. Skyline queries identify the most “interesting” objects for users’

incomplete preferences

, which provides users with intuitive query formulation mechanism. However, the applicability of this intuitive query paradigm suffers from a severe drawback. Incomplete preferences on domain values can often lead to impractical skyline result sizes. In particular, this challenge is more critical over categorical domains. This paper addresses this challenge by developing an iterative elicitation framework. While user preferences are collected at each iteration, the framework aims to both minimize user interaction and maximize skyline reduction. The framework allows to identify a reasonably small and focused skyline set, while keeping the query formulation still intuitive for users. All that is needed is answering a few well-chosen questions. We perform extensive experiments to validate the benefits of our strategy and prove that a few questions are enough to acquire a desired manageable skyline set.

Jongwuk Lee, Gae-won You, Seung-won Hwang, Joachim Selke, Wolf-Tilo Balke

Session 14A: Data and Information Streams

Categorized Sliding Window in Streaming Data Management Systems

For many applications, data is collected at very large rates from various sources. Applications that produce results from this data have a requirement for very efficient processing in order to achieve timely decisions. An example of such a demanding applications is one that takes decisions on stock acquisition based on the price updates that happen constantly while the market is open for transactions. Our proposed technique is a simple yet effective way to reduce the access time to the streaming data.

In this paper we propose an efficient indexing technique that improves the access time to data elements in sliding windows of streamed database systems. This technique, called

Categorized Sliding Window

, is based on splitting the data into categories and using bit vectors to avoid accesses to non-relevant data.

Our experimental results show large improvements compared with simpler techniques. For the standard Linear Road benchmark we observe a performance improvement of 3.3x for a complex continuous query. Also relevant is the fact that 90% of the performance improvement is achieved with only 65% of the maximum number of categories, which represents a memory overhead of only 13.5%.

Marios Papas, Josep-L. Larriba-Pey, Pedro Trancoso

Time to the Rescue - Supporting Temporal Reasoning in the Rete Algorithm for Complex Event Processing

Complex event processing is an important technology with possible application in supply chain management and business activity monitoring. Its basis is the identification of event patterns within multiple occurring events having logical, causal or temporal relationships.

The Rete algorithm is commonly used in rule-based systems to trigger certain actions if a corresponding rule holds. The algorithm’s good performance for a high number of rules makes it ideally suited for complex event detection. However, the traditional Rete algorithm does not support aggregation of values in time-based windows although this is a common requirement in complex event processing for business applications.

We propose an extension of the Rete algorithm to support temporal reasoning, namely the detection of time-based windows by adding a time-enabled beta-node to restrict event detection to a certain time-frame.

Karen Walzer, Matthias Groch, Tino Breddin

Classifying Evolving Data Streams Using Dynamic Streaming Random Forests

We consider the problem of data-stream classification, introducing a stream-classification algorithm,

Dynamic Streaming Random Forests

, that is able to handle evolving data streams using an entropy-based drift-detection technique. The algorithm automatically adjusts its parameters based on the data seen so far. Experimental results show that the algorithm handles multi-class problems for which the underlying class boundaries drift, without losing accuracy.

Hanady Abdulsalam, David B. Skillicorn, Patrick Martin

Session 14B: Potpourri

On a Parameterized Antidivision Operator for Database Flexible Querying

In this paper, we introduce an algebraic relational operator called antidivision and we describe a range of interpretations that can be attached to this operator in the context of databases with fuzzy relations (i.e., relations that contain weighted tuples).

Patrick Bosc, Olivier Pivert

Providing Explanations for Database Schema Validation

We propose a new method for database schema validation that provides an explanation when it determines that a certain desirable property of a database schema does not hold. Explanations are required to give the designer a hint about the changes of the schema that are needed to fix the problem identified. Our method is an extension of the CQC method, which has been shown successful for testing such properties.

Guillem Rull, Carles Farré, Ernest Teniente, Toni Urpí

Temporal Conformance of Federated Choreographies

Web service composition is a new way for implementing business processes. In particular, a choreography supports modeling and enactment of interorganizational business processes consisting of autonomous organizations. Temporal constraints are important quality criteria. We propose a technique for modeling temporal constraints in choreographies and orchestrations, checking whether the orchestrations satisfy the temporal constraints of a choreography and compute internal deadlines for the activities in an interorganizational workflow.

Johann Eder, Amirreza Tahamtan

Relational Database Migration: A Perspective

This paper presents an investigation into approaches and techniques used for database conversion. Constructing object views on top of a Relational DataBase (RDB), simple database integration and database migration are among these approaches. We present a categorisation of selected works proposed in the literature and translation techniques used for the problem of database conversion, concentrating on migrating an RDB as source into object-based and XML databases as targets. Database migration from the source into each of the targets is discussed in detail including semantic enrichment, schema translation and data conversion. Based on a detailed analysis of the existing literature, we conclude that an existing RDB can be migrated into object-based/XML databases according to available database standards. We propose an integrated method for migrating an RDB into object-based/XML databases using an intermediate Canonical Data Model (CDM), which enriches the source database’s semantics and captures characteristics of the target databases. A prototype has been implemented, which successfully translates CDM into object-oriented (ODMG 3.0 ODL), object-relational (Oracle 10g) and XML schemas.

Abdelsalam Maatuk, Akhtar Ali, Nick Rossiter

Session 15A: Data Mining

DB-FSG: An SQL-Based Approach for Frequent Subgraph Mining

Mining frequent subgraphs (FSG) is one form of graph mining for which only main memory algorithms exist currently. There are many applications in social networks, biology, computer networks, chemistry and the World Wide Web that require mining of frequent subgraphs. The focus of this paper is to apply relational database techniques to support frequent subgraph mining. Some of the computations, such as duplicate elimination, canonical labeling, and isomorphism checking are not straightforward using SQL. The contribution of this paper is to efficiently map complex computations to relational operators. Unlike the main memory counter parts of FSG, our approach addresses the most general graph representation including multiple edges between any two vertices, bi-directional edges, and cycles. Experimental evaluation of the proposed approach is also presented in the paper.

Sharma Chakravarthy, Subhesh Pradhan

Efficient Bounds in Finding Aggregate Nearest Neighbors

Developed from Nearest Neighbor (NN) queries, Aggregate Nearest Neighbor (ANN) queries return the object that minimizes an aggregate distance function with respect to a set of query points. Because of the multiple query points, ANN queries are much more complex than NN queries. For optimizing the query processing and improving the query efficiency, many ANN queries algorithms utilizes pruning strategies, with or without an index structure. Obviously, the pruning effect highly depends on the tightness of the bound estimation. In this paper, we figure out a property in vector space and develop some efficient bound estimations for two most popular types of ANN queries. Based on these bounds, we design the indexed and non-index ANN algorithms, and conduct experimental studies. Our algorithms show good performance, especially for high dimensional queries, for both real dataset and synthetic datasets.

Sansarkhuu Namnandorj, Hanxiong Chen, Kazutaka Furuse, Nobuo Ohbo

A Grid-Based Multi-relational Approach to Process Mining

Industrial, scientific, and commercial applications use information systems to trace the execution of a business process. Relevant events are registered in massive logs and process mining techniques are used to automatically discover knowledge that reveals the execution and organization of the process instances (cases). In this paper, we investigate the use of a multi-level relational frequent pattern discovery method as a means of process mining. In order to process such massive logs we resort to a Grid-based implementation of the knowledge discovery algorithm that distributes the computation on several nodes of a Grid platform. Experiments are performed on real event logs.

Antonio Turi, Annalisa Appice, Michelangelo Ceci, Donato Malerba

Extraction of Opposite Sentiments in Classified Free Format Text Reviews

Most of the previous approaches in opinion mining focus on the classifications of opinion polarities,

positive

or

negative

, expressed in customer reviews. In this paper, we present the problem of extracting contextual opposite sentiments in classified free format text reviews. We adapt the sequence data model to text mining with Part-of-Speech tags, and then we propose a belief-driven approach for extracting contextual opposite sentiments as unexpected sequences with respect to the opinion polarity of reviews. We conclude by detailing our experimental results on free format text movie review data.

Dong (Haoyuan) Li, Anne Laurent, Mathieu Roche, Pascal Poncelet

Session 15B: XML Databases

Navigational Path Expressions on XML Schemas

XML Schema is employed for describing the type and structure of information contained in valid XML documents. As for a document, a schema can be navigated and its components can be identified through a path language. In this paper we discuss the drawbacks of using XPath for this purpose and present

XSPath

, a language tailored for specifying path expressions on schemas.

Federico Cavalieri, Giovanna Guerrini, Marco Mesiti

Transforming Tree Patterns with DTDs for Query Containment Test

We study the problem of testing

xpath

query containment under

dtd

s, where the

xpath

expressions are given as tree patterns involving /,//,[] and *, and the

dtd

are given as acyclic schema graphs. We focus on efficient algorithms to transform a tree pattern

P

involving * into a new one

P

′ which does not have *, using

dtd

G

, so that testing containment of

P

in any other pattern

Q

under

G

is reduced to testing whether

P

′ is contained in

Q

without

dtd

, provided

Q

does not have *.

Junhu Wang, Jeffrey Xu Yu, Chengfei Liu, Rui Zhou

XSelMark: A Micro-benchmark for Selectivity Estimation Approaches of XML Queries

Estimating the sizes of query results and intermediate results is a crucial part of any effective query optimization process. Due to several reasons, the selectivity estimation problem in the XML domain is more complicated than that in the relational domain. Several research efforts have proposed selectivity estimation approaches in the XML domain. Lacking of a

suitable

benchmark was one of the main reasons which prevented a

real

assessment and comparison between the approaches to be conducted. In this paper we propose a selectivity estimation benchmark for XML queries, XSelMark. It consists of a set of 25 queries organized into seven groups and covers the main aspects of selectivity estimation of XML queries. These queries have been designed with respect to an XML document instance of a popular benchmark for XML data management, XMark. In addition, we suggest some criteria of assessing the capability and quality of XML queries selectivity estimation approaches. Finally, we use the proposed benchmark to assess the capabilities of the-state-of-the-art of the selectivity estimation approaches.

Sherif Sakr

Session 16A: Applications of Database, Information, and Decision Support Systems

A Method for Semi-automatic Standard Integration in Systems Biology

The development of standards for biological pathways has led to a huge amount of model data stored in a variety of different formats represented in XML (e.g. SBML) or OWL (e.g. BioPAX). There is an urgent need for the conversion of data between the formats, but the fact that the transformation is hard to realize hampers the integration of data in the area. This article proposes a general, semi-automatic solution by transforming XML Schema based data into an OWL format. Biologists will be supported in querying data of any format and comparing different data files or schemas to each other using OWL as a common format for matching. Additionally, a backwards transformation to XML Schema is provided. The paper presents a first architectural approach and its prototype implementation. The evaluation showed that the approach is promising.

Dagmar Köhn, Lena Strömbäck

FDBMS Application for a Survey on Educational Performance

Important efforts have been done in providing fuzzy querying capabilities. Nevertheless, at present time too few real life systems are taking advantage of these works. We present here an application for reviewing results of a periodical student opinion survey on educational performance of professors. This application has been developed using SQLfi Fuzzy DBMS.

Livia Borjas, Alejandro Fernández, Jorge Ortiz, Leonid Tineo

Hierarchy Encoding with Multiple Genes

Efficient implementation of type inclusion testing is important for data and knowledge base systems employing large hierarchies. The bit vector encoding of a partially ordered set representing a type hierarchy permits constant-time type inclusion testing. Current such methods employ a simple encoding, associating a single gene for each join-irreducible element. We present an algorithm using multiple genes for those elements with many siblings. The new algorithm provides a significant improvement on the encoding size for hierarchies with low multiple inheritance factors.

Martin van Bommel, Ping Wang

Knowledge Mining for the Business Analyst

There is an extensive literature on data mining techniques, including several applications of these techniques in the e-commerce setting. However, all previous approaches require that expert users interpret the data mining results, making them cumbersome to use by business analysts. In this work, we describe a framework that shows how data mining technology can be effectively applied in an e-commerce environment, delivering significant benefits to the business analyst. Using a real-world case study, we demonstrate the added benefit of the proposed method. We also validate the claim that the produced results represent actionable knowledge that can help the business analyst improve the business performance, by significantly reducing the time needed for data analysis, which results in substantial financial savings.

Themis Palpanas, Jakka Sairamesh

Session 16B: Optimization and Performance

Controlling the Behaviour of Database Servers with 2PAC and DiffServ

In order to avoid stress conditions in information systems, the use of a simple admission control (SAC) mechanism is widely adopted by systems’ administrators. Most of the SAC approaches limit the number of concurrent work, redirecting to a waiting FCFS queue all transactions that exceed that number. The introduction of such a policy can be very useful when the most important metric for the system is the total throughput. But such a simple AC approach may not be sufficient when transactions have deadlines to meet, since in stressed scenarios a transaction may spend a lot of time only waiting for execution. This paper presents 2 enhancements that help keeping the number of transactions executed within the deadline near to the throughput. The enhancements are DiffServ, in which short transactions have priority, and a 2-Phase Admission Control (2PAC) mechanism, which tries to avoid the previousmentioned problem by limiting the queue size dynamically using informations provided by a feedback control. It also introduces the QoS-Broker – a tool which implements both SAC and 2PAC – and uses it to compare their performances when submitted to the TPC-C benchmark. Our results show that both total throughput and throughput within deadline increase when the 2 enhancements are used, although it becomes clear that 2PAC has a much bigger impact on performance than DiffServ.

Luís Fernando Orleans, Geraldo Zimbrão, Pedro Furtado

Compressing Very Large Database Workloads for Continuous Online Index Selection

The paper presents a novel method for compressing large database workloads for purpose of autonomic, continuous index selection. The compressed workload contains a small subset of representative queries from the original workload. A single pass clustering algorithm with a simple and elegant selectivity based query distance metric guarantees low memory and time complexity. Experiments on two real-world database workloads show the method achieves high compression ratio without decreasing the quality of the index selection problem solutions.

Piotr Kołaczkowski

Escaping a Dominance Region at Minimum Cost

Skyline queries have gained attention as an effective way to identify desirable objects that are “not dominated” by another object in the dataset. From market perspective, such objects can be viewed as

marketable

, as each of such objects has at least one competitive edge against all the other objects, or not dominated. In other words, non-skyline objects are not marketable, as there always exists another product excelling in all the attributes. The goal of this paper is, for such non-skyline objects, to identify the cost-minimal enhancement to become a skyline point to gain marketability. More specifically, we abstract this problem as a mixed integer programming problem and develop a novel algorithm for efficiently identifying the optimal solution. Through extensive experiments using synthetic datasets, we show that our proposed framework is both efficient and scalable over extensive experiment settings.

Youngdae Kim, Gae-won You, Seung-won Hwang

Session 17: Schema, Process and Knowledge Modelling and Evolution

Evolutionary Clustering in Description Logics: Controlling Concept Formation and Drift in Ontologies

We present a method based on clustering techniques to detect concept drift or novelty in a knowledge base expressed in Description Logics. The method exploits an effective and language-independent semi-distance measure defined for the space of individuals, that is based on a finite number of dimensions corresponding to a committee of discriminating features (represented by concept descriptions). In the algorithm, the possible clusterings are represented as strings of central elements (medoids, w.r.t. the given metric) of variable length. The number of clusters is not required as a parameter; the method is able to find an optimal choice by means of the evolutionary operators and of a fitness function. An experimentation with some ontologies proves the feasibility of our method and its effectiveness in terms of clustering validity indices. Then, with a supervised learning phase, each cluster can be assigned with a refined or newly constructed intensional definition expressed in the adopted language.

Nicola Fanizzi, Claudia d’Amato, Floriana Esposito

Model–Driven, View–Based Evolution of Relational Databases

Among other issues, database evolution includes the necessity of propagating the changes inside and between abstraction levels. There exist several mechanisms in order to carry out propagations from one level to another, that are distinguished on the basis of when and how the changes are performed. The strict mechanism, which implies the immediate realization of modifications, is a time–consuming process. In this paper we propose a solution that is closer to the lazy and logical mechanisms, in which changes are delayed or not finally realized, respectively. This solution makes use of the notion of view. The use of views allows the data not to be changed if it is not necessary and facilitates carrying out changes when required.

Eladio Domínguez, Jorge Lloret, Ángel L. Rubio, María A. Zapata

Inventing Less, Reusing More, and Adding Intelligence to Business Process Modeling

Recently, a variety of workflow patterns has been proposed focusing on specific aspects like control flow, data flow, and resource assignments. Though these patterns are relevant for implementing Business Process Modeling (BPM) tools and for evaluating the expressiveness of BPM languages, they do not contribute to reduce redundant specifications of recurrent business functions when modeling business processes. Furthermore, contemporary BPM tools do not support process designers in defining, querying, and reusing activity patterns as building blocks for process modeling. Related to these problems this paper proposes a set of activity patterns, evidences their practical relevance, and introduces a BPM tool for the modeling of business processes based on the reuse of these activity patterns. Altogether our approach fosters reuse of business function specifications and helps to improve the quality and comparability of business process models.

Lucinéia H. Thom, Manfred Reichert, Carolina M. Chiao, Cirano Iochpe, Guillermo N. Hess

Springer Professional

About this book

Table of Contents

Frontmatter

Invited Talk

Towards Engineering Purposeful Systems: A Requirements Engineering Perspective

Session 1

Data Privacy

Hiding Frequent Patterns under Multiple Sensitive Thresholds

BSGI: An Effective Algorithm towards Stronger l-Diversity

Temporal, Spatial, and High Dimensional Databases I

The Truncated Tornado in TMBB: A Spatiotemporal Uncertainty Model for Moving Objects

Reordering of Location Identifiers for Indexing an RFID Tag Object Database

A Free Terrain Model for Trajectory K–Anonymity

HRG: A Graph Structure for Fast Similarity Search in Metric Spaces

Session 2A: Semantic Web and Ontologies

Word Sense Disambiguation as the Primary Step of Ontology Integration

Enriching Ontology for Deep Web Search

POEM: An Ontology Manager Based on Existence Constraints

Session 2B: Query Processing

Extending Inconsistency-Tolerant Integrity Checking by Semantic Query Optimization

On the Evaluation of Large and Sparse Graph Reachability Queries

SQL TVF Controlling Forms - Express Structured Parallel Data Intensive Computing

A Decidable Fuzzy Description Logic F-ALC(G)

Session 3: Web and Information Retrieval

Ranking Entities Using Comparative Relations

Query Recommendation Using Large-Scale Web Access Logs and Web Page Archive

Description Logic to Model a Domain Specific Information Retrieval System

Extending the Edit Distance Using Frequencies of Common Characters

Session 4: Mobile Data and Information

Tracking Moving Objects in Anonymized Trajectories

REALM: Replication of Data for a Logical Group Based MANET Database

A Cache Management Method for the Mobile Music Delivery System: JAMS

EcoRare: An Economic Incentive Scheme for Efficient Rare Data Accessibility in Mobile-P2P Networks

Session 5: Data and Information Streams

Identifying Similar Subsequences in Data Streams

A Tree-Based Approach for Event Prediction Using Episode Rules over Event Streams

Effective Skyline Cardinality Estimation on Data Streams

Session 6: Data Mining Algorithms

Detecting Current Outliers: Continuous Outlier Detection over Time-Series Data Streams

Component Selection to Optimize Distance Function Learning in Complex Scientific Data Sets

Emerging Pattern Based Classification in Relational Data Mining

Session 7: Multimedia Databases

Rosso Tiziano: A System for User-Centered Exploration and Discovery in Large Image Information Bases

NM-Tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces

Efficient Processing of Nearest Neighbor Queries in Parallel Multimedia Databases

Session 8: Data Mining Systems, Data Warehousing, OLAP

OLAP for Trajectories

A Probabilistic Approach for Computing Approximate Iceberg Cubes

Noise Control Boundary Image Matching Using Time-Series Moving Average Transform

Approximate Range-Sum Queries over Data Cubes Using Cosine Transform

Session 9: Temporal, Spatial, and High Dimensional Databases II

Querying Multigranular Spatio-temporal Objects

Space-Partitioning-Based Bulk-Loading for the NSP-Tree in Non-ordered Discrete Data Spaces

Efficient Updates for Continuous Skyline Computations

Session 10: Data and Information Semantics

Semantic Decision Tables: Self-organizing and Reorganizable Decision Tables

Translating SQL Applications to the Semantic Web

An Agent Framework Based on Signal Concepts for Highlighting the Image Semantic Content

Session 11: XML Databases I

Supporting Proscriptive Metadata in an XML DBMS

XPath Rewriting Using Multiple Views

Superimposed Code-Based Indexing Method for Extracting MCTs from XML Documents

Fast Matching of Twig Patterns

Session 12: XML Databases II

XML Filtering Using Dynamic Hierarchical Clustering of User Profiles

Person Retrieval on XML Documents by Coreference Analysis Utilizing Structural Features

HFilter: Hybrid Finite Automaton Based Stream Filtering for Deep and Recursive XML Data

Session 13: Query Processing and Optimization

Read-Optimized, Cache-Conscious, Page Layouts for Temporal Relational Data

Exploiting Interactions among Query Rewrite Rules in the Teradata DBMS

Optimal Preference Elicitation for Skyline Queries over Categorical Domains

Session 14A: Data and Information Streams

Categorized Sliding Window in Streaming Data Management Systems

Time to the Rescue - Supporting Temporal Reasoning in the Rete Algorithm for Complex Event Processing

Classifying Evolving Data Streams Using Dynamic Streaming Random Forests

Session 14B: Potpourri

On a Parameterized Antidivision Operator for Database Flexible Querying

Providing Explanations for Database Schema Validation

Temporal Conformance of Federated Choreographies