Keynote Addresses

Entity Resolution: Overview and Challenges

Entity resolution is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers). However, there are no unique identifiers that tell us what records from one source correspond to those in the other sources. Furthermore, the records representing the same entity may have differing information, e.g., one record may have the address misspelled, another record may be missing some fields. An entity resolution algorithm attempts to identify the matching records from multiple sources (i.e., those corresponding to the same real-world entity), and merges the matching records as best it can. Entity resolution algorithms typically rely on user-defined functions that (a) compare fields or records to determine if they match (are likely to represent the same real world entity), and (b) merge matching records into one, and in the process perhaps combine fields (e.g., creating a new name based on two slightly different versions of the name).In this talk I will give an overview of the Stanford SERF Project, that is building a framework to describe and evaluate entity resolution schemes. In particular, I will give an overview of some of the different entity resolution settings: De-duplication versus fidelity enhancement. In the de-duplication problem, we have a single set of records, and we try to merge the ones representing the same real world entity. In the fidelity enhancement problem, we have two sets of records: a base set of records of interest, and a new set of acquired information. The goal is to coalesce the new information into the base records.Clustering versus snapping. With snapping, we examine records pair-wise and decide if they represent the same entity. If they do, we merge the records into one, and continue the process of pair-wise comparisons. With clustering, we analyze all records and partition them into groups we believe represent the same real world entity. At the end, each partition is merged into one record.Confidences. In some entity resolution scenarios we must manage confidences. For example, input records may have a confidence value representing how likely it is they are true. Snap rules (that tells us when two records match) may also have confidences representing how likely it is that two records actually represent the same real world entity. As we merge records, we must track their confidences.Schema Mismatches. In some entity resolution scenarios we must deal, not just with resolving information on entities, but also with resolving discrepancies among the schemas of the different sources. For example, the attribute names and formats from one source may not match those of other sources.

Hector Garcia-Molina

Towards a Statistically Semantic Web

The envisioned Semantic Web aims to provide richly annotated and explicitly structured Web pages in XML, RDF, or description logics, based upon underlying ontologies and thesauri. Ideally, this should enable a wealth of query processing and semantic reasoning capabilities using XQuery and logical inference engines. However, we believe that the diversity and uncertainty of terminologies and schema-like annotations will make precise querying on a Web scale extremely elusive if not hopeless, and the same argument holds for large-scale dynamic federations of Deep Web sources. Therefore, ontology-based reasoning and querying needs to be enhanced by statistical means, leading to relevance-ranked lists as query results.This paper presents steps towards such a “statistically semantic” Web and outlines technical challenges. We discuss how statistically quantified ontological relations can be exploited in XML retrieval, how statistics can help in making Web-scale search efficient, and how statistical information extracted from users’ query logs and click streams can be leveraged for better search result ranking. We believe these are decisive issues for improving the quality of next-generation search engines for intranets, digital libraries, and the Web, and they are crucial also for peer-to-peer collaborative Web search.

Gerhard Weikum, Jens Graupmann, Ralf Schenkel, Martin Theobald

Invited Talk

The Application and Prospect of Business Intelligence in Metallurgical Manufacturing Enterprises in China

This paper introduces the application of Business Intelligence (BI) technologies in metallurgical manufacturing enterprises in China. It sets forth the development procedure and successful cases of BI in Shanghai Baoshan Iron & Steel Co., Ltd (Shanghai Basteel in short), and puts forward the methodology adaptable to the construction of BI systems in the metallurgical manufacturing enterprises in China. Finally, it prospects the next generation of BI technologies in Shanghai Baosteel. It should be mentioned as well that it is the Data Strategies Dept of Shanghai Baosight Software Co., Ltd (Shanghai Baosight in short) and the Technology Center of Shanghai Baoshan Iron & Steel Co., Ltd. that supports and does research works on BI solutions in Shanghai Baosteel.

Xiao Ji, Hengjie Wang, Haidong Tang, Dabin Hu, Jiansheng Feng

Conceptual Modeling I

Conceptual Modelling – What and Why in Current Practice

Much research has been devoted over the years to investigating and advancing the techniques and tools used by analysts when they model. As opposed to what academics, software providers and their resellers promote as should be happening, the aim of this research was to determine whether practitioners still embraced conceptual modelling seriously. In addition, what are the most popular techniques and tools used for conceptual modelling’ What are the major purposes for which conceptual modelling is used’ The study found that the top six most frequently used modelling techniques and methods were ER diagramming, data flow diagramming, systems flowcharting, workflow modelling, RAD, and UML. However, the primary contribution of this study was the identification of the factors that uniquely influence the continued-use decision of analysts, viz., communication (using diagrams) to/from stakeholders, internal knowledge (lack of) of techniques, user expectations management, understanding models integration into the business, and tool/software deficiencies.

Islay Davies, Peter Green, Michael Rosemann, Stan Gallo

Entity-Relationship Modeling Re-revisited

Since its introduction, the Entity-Relationship (ER) model has been the vehicle of choice in communicating the structure of a database schema in an implementation-independent fashion. Part of its popularity has no doubt been due to the clarity and simplicity of the associated pictorial Entity-Relationship Diagrams (“ERD’s”S) and to the dependable mapping it affords to a relational database schema. Although the model has been extended in different ways over the years, its basic properties have been remarkably stable. Even though the ER model has been seen as pretty well “settled,” some recent papers, notably [4] and [2 (from whose paper our title is derived)], have enumerated what their authors consider serious shortcomings of the ER model. They illustrate these by some interesting examples. We believe, however, that those examples are themselves questionable. In fact, while not claiming that the ER model is perfect, we do believe that the overhauls hinted at are probably not necessary and possibly counterproductive.

Don Goelman, Il-Yeol Song

Modeling Functional Data Sources as Relations

In this paper we present a model of functional access to data that, we argue, is suitable for modeling a class of data repositories characterized by functional access, such as web sites. We discuss the problem of modeling such data sources as a set of relations, of determining whether a given query expressed on these relations can be translated into a combination of functions defined by the data sources, and of finding an optimal plan to do so.We show that, if the data source is modeled as a single relation, an optimal plan can be found in a time linear in the number of functions in the source but, if the source is modeled as a number of relations that can be joined, finding the optimal plan is NP-hard.

Simone Santini, Amarnath Gupta

Conceptual Modeling II

Roles as Entity Types: A Conceptual Modelling Pattern

Roles are meant to capture dynamic and temporal aspects of real-world objects. The role concept has been used with many semantic meanings: dynamic class, aspect, perspective, interface or mode. This paper identifies common semantics of different role models found in the literature. Moreover, it presents a conceptual modelling pattern for the role concept that includes both the static and dynamic aspects of roles. A conceptual modelling pattern is aimed at representing a specific structure of knowledge that appears in different domains. In particular, we adapt the pattern to UML. The use of this pattern eases the definition of roles in conceptual schemas. In addition, we describe the design of schemas defined using our pattern in order to implement them in any object-oriented language. We also discuss the advantages of our approach over previous ones.

Jordi Cabot, Ruth Raventós

Modeling Default Induction with Conceptual Structures

Our goal is to model the way people induce knowledge from rare and sparse data. This paper describes a theoretical framework for inducing knowledge from these incomplete data described with conceptual graphs. The induction engine is based on a non-supervised algorithm named default clustering which uses the concept of stereotype and the new notion of default subsumption, the latter being inspired by the default logic theory. A validation using artificial data sets and an application concerning an historic case are given at the end of the paper.

Julien Velcin, Jean-Gabriel Ganascia

Reachability Problems in Entity-Relationship Schema Instances

Recent developments in reification of ER schemata include automatic generation of web-based database administration systems [1,2]. These systems enforce the schema cardinality constraints, but, beyond unsatisfiable schemata, this feature may create unreachable instances. We prove sound and complete characterisations of schemata whose instances satisfy suitable reachability properties; these theorems translate into linear algorithms that can be used to prevent the administrator from reifying schemata with unreachable instances.

Sebastiano Vigna

Conceptual Modeling III

A Reference Methodology for Conducting Ontological Analyses

The ontological analysis of conceptual modelling techniques is of increasing popularity. Related research did not only explore the ontological deficiencies of classical techniques such as ER or UML, but also business process modelling techniques such as ARIS or even Web services standards such as BPEL4WS. While the selected ontologies are reasonably mature, it is the actual process of an ontological analysis that still lacks rigor. The current procedure leaves significant room for individual interpretations and is one reason for criticism of the entire ontological analysis. This paper proposes a procedural model for the ontological analysis based on the use of meta models, the involvement of more than one coder and metrics. This model is explained with examples from various ontological analyses.

Michael Rosemann, Peter Green, Marta Indulska

Pruning Ontologies in the Development of Conceptual Schemas of Information Systems

In the past, most conceptual schemas of information systems have been developed essentially from scratch. Currently, however, several research projects are considering an emerging approach that tries to reuse as much as possible the knowledge included in existing ontologies. Using this approach, conceptual schemas would be developed as refinements of (more general) ontologies. However, when the refined ontology is large, a new problem that arises using this approach is the need of pruning the concepts in that ontology that are superfluous in the final conceptual schema. This paper proposes a new method for pruning ontologies in this approach. We show the advantages of our method with respect to similar pruning methods developed in other contexts. Our method is general and it can be adapted to most conceptual modeling languages. We give the complete details of its adaptation to the UML. On the other hand, the method is fully automatic. The method has been implemented. We illustrate the method by means of its application to a case study that refines the Cyc ontology.

Jordi Conesa, Antoni Olivé

Definition of Events and Their Effects in Object-Oriented Conceptual Modeling Languages

Most current conceptual modeling languages and methods do not model events as entities. We argue that, at least in Object-Oriented (O-O) languages, modeling events as entities provides substantial benefits. We show that a method for behavioral modeling that deals with event and entity types in a uniform way may yield better behavioral schemas. The proposed method makes an extensive use of language constructs such as constraints, derived types, derivation rules, type specializations and operations, which are present in all complete O-O conceptual modeling languages. The method can be adapted to most O-O languages. In this paper we explain its adaptation to the UML.

Antoni Olivé

Conceptual Modeling IV

Enterprise Modeling with Conceptual XML

An open challenge is to integrate XML and conceptual modeling in order to satisfy large-scale enterprise needs. Because enterprises typically have many data sources using different assumptions, formats, and schemas, all expressed in – or soon to be expressed in – XML, it is easy to become lost in an avalanche of XML detail. This creates an opportunity for the conceptual modeling community to provide improved abstractions to help manage this detail. We present a vision for Conceptual XML (C-XML) that builds on the established work of the conceptual modeling community over the last several decades to bring improved modeling capabilities to XML-based development. Building on a framework such as C-XML will enable better management of enterprise-scale data and more rapid development of enterprise applications.

David W. Embley, Stephen W. Liddle, Reema Al-Kamha

Graphical Reasoning for Sets of Functional Dependencies

Reasoning on constraint sets is a difficult task. Classical database design is based on a step-wise extension of the constraint set and on a consideration of constraint sets through generation by tools. Since the database developer must master semantics acquisition, tools and approaches are still sought that support reasoning on sets of constraints. We propose novel approaches for presentation of sets of functional dependencies based on specific graphs. These approaches may be used for the elicitation of the full knowledge on validity of functional dependencies in relational schemata.

János Demetrovics, András Molnár, Bernhard Thalheim

ER-Based Software Sizing for Data-Intensive Systems

Despite the existence of well-known software sizing methods such as Function Point method, many developers still continue to use ad-hoc methods or so called “expert” approaches. This is mainly due to the fact that the existing methods require much implementation information that is difficult to identify or estimate in the early stage of a software project. The accuracy of ad-hoc and “expert” methods also has much problem. The entity-relationship (ER) model is widely used in conceptual modeling (requirements analysis) for data-intensive systems. From our observation, the characteristic of a data-intensive system, and therefore the source code of its software, is well characterized by the ER diagram that models its data. Based on this observation, this paper proposes a method for building software size model from extended ER diagram through the use of regression models. We have collected some real data from the industry to do a preliminary validation of the proposed method. The result of the validation is very encouraging. As software sizing is an important key to software cost estimation and therefore vital to the industry for managing their software projects, we hope that the research and industry communities can further validate the proposed method.

Hee Beng Kuan Tan, Yuan Zhao

Data Warehouse

Data Mapping Diagrams for Data Warehouse Design with UML

In Data Warehouse (DW) scenarios, ETL (Extraction, Transformation, Loading) processes are responsible for the extraction of data from heterogeneous operational data sources, their transformation (conversion, cleaning, normalization, etc.) and their loading into the DW. In this paper, we present a framework for the design of the DW back-stage (and the respective ETL processes) based on the key observation that this task fundamentally involves dealing with the specificities of information at very low levels of granularity including transformation rules at the attribute level. Specifically, we present a disciplined framework for the modeling of the relationships between sources and targets in different levels of granularity (including coarse mappings at the database and table levels to detailed inter-attribute mappings at the attribute level). In order to accomplish this goal, we extend UML (Unified Modeling Language) to model attributes as first-class citizens. In our attempt to provide complementary views of the design artifacts in different levels of detail, our framework is based on a principled approach in the usage of UML packages, to allow zooming in and out the design of a scenario.

Sergio Luján-Mora, Panos Vassiliadis, Juan Trujillo

Informational Scenarios for Data Warehouse Requirements Elicitation

We propose a requirements elicitation process for a data warehouse (DW) that identifies its information contents. These contents support the set of decisions that can be made. Thus, if the information needed to take every decision is elicited, then the total information determines DW contents. We propose an Informational Scenario as the means to elicit information for a decision. An informational scenario is written for each decision and is a sequence of pairs of the form <Query, Response>. A query requests for information necessary to take a decision and the response is the information itself. The set of responses for all decisions identifies DW contents. We show that informational scenarios are merely another sub class of the class of scenarios.

Naveen Prakash, Yogesh Singh, Anjana Gosain

Extending UML for Designing Secure Data Warehouses

Data Warehouses (DW), Multidimensional (MD) Databases, and On-Line Analytical Processing Applications are used as a very powerful mechanism for discovering crucial business information. Considering the extreme importance of the information managed by these kinds of applications, it is essential to specify security measures from early stages of the DW design in the MD modeling process, and enforce them. In the past years, there have been some proposals for representing main MD modeling properties at the conceptual level. Nevertheless, none of these proposals considers security measures as an important element in their models, so they do not allow us to specify confidentiality constraints to be enforced by the applications that will use these MD models. In this paper, we discuss the confidentiality problems regarding DW’s and we present an extension of the Unified Modeling Language (UML) that allows us to specify main security aspects in the conceptual MD modeling, thereby allowing us to design secure DW’s. Then, we show the benefit of our approach by applying this extension to a case study. Finally, we also sketch how to implement the security aspects considered in our conceptual modeling approach in a commercial DBMS.

Eduardo Fernández-Medina, Juan Trujillo, Rodolfo Villarroel, Mario Piattini

Schema Integration I

Data Integration with Preferences Among Sources

Data integration systems represent today a key technological infrastructure for managing the enormous amount of information even more and more distributed over many data sources, often stored in different heterogeneous formats. Several different approaches providing transparent access to the data by means of suitable query answering strategies have been proposed in the literature. These approaches often assume that all the sources have the same level of reliability and that there is no need for preferring values “extracted” from a given source. This is mainly due to the difficulties of properly translating and reformulating source preferences in terms of properties expressed over the global view supplied by the data integration system. Nonetheless preferences are very important auxiliary information that can be profitably exploited for refining the way in which integration is carried out. In this paper we tackle the above difficulties and we propose a formal framework for both specifying and reasoning with preferences among the sources. The semantics of the system is restated in terms of preferred answers to user queries, and the computational complexity of identifying these answers is investigated as well.

Gianluigi Greco, Domenico Lembo

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas

In schema integration, chematic discrepancies occur when data in one database correspond to metadata in another. We define this kind of semantic heterogeneity in general using the paradigm of context that is the meta information relating to the source, classification, property etc of entities, relationships or attribute values in entity-relationship (ER) schemas. We present algorithms to resolve schematic discrepancies by transforming metadata into entities, keeping the information and constraints of original schemas. Although focusing on the resolution of schematic discrepancies, our technique works seamlessly with existing techniques resolving other semantic heterogeneities in schema integration.

Qi He, Tok Wang Ling

Managing Merged Data by Vague Functional Dependencies

In this paper, we propose a new similarity measure between vague sets and apply vague logic in a relational database environment with the objective of capturing the vagueness of the data. By introducing a new vague Similar Equality (S_EQ) for comparing data values, we first generalize the classical Functional Dependencies (FDs) into Vague Functional Dependencies (VFDs). We then present a set of sound and complete inference rules. Finally, we study the validation process of VFDs by examining the satisfaction degree of VFDs, and the merge-union and merge-intersection on vague relations.

An Lu, Wilfred Ng

Schema Integration II

Merging of XML Documents

How to deal with the heterogeneous structures of XML documents, identify XML data instances, solve conflicts, and effectively merge XML documents to obtain complete information is a challenge. In this paper, we define a merging operation over XML documents that can merge two XML documents with different structures. It is similar to a full outer join in relational algebra. We design an algorithm for this operation. In addition, we propose a method for merging XML elements and handling typical conflicts. Finally, we present a merge template XML file that can support recursive processing and merging of XML elements.

Wanxia Wei, Mengchi Liu, Shijun Li

Schema-Based Web Wrapping

An effective solution to automate information integration is represented by wrappers, i.e. programs which are designed for extracting relevant contents from a particular information source, such as web pages. Wrappers allow such contents to be delivered through a self-describing and easily processable representation model. However, most existing approaches to wrapper designing focus mainly on how to generate extraction rules, while do not weigh the importance of specifying and exploiting the desired schema of the extracted information. In this paper, we propose a new wrapping approach which encompasses both extraction rules and the schema of required information in wrapper definitions. We investigate the advantages of suitably exploiting extraction schemata, and we define a clean declarative wrapper semantics by introducing (preferred) extraction models for source HTML documents with respect to a given wrapper.

Sergio Flesca, Andrea Tagarelli

Web Taxonomy Integration Using Spectral Graph Transducer

We address the problem of integrating objects from a source taxonomy into a master taxonomy. This problem is not only currently pervasive on the web, but also important to the emerging semantic web. A straightforward approach to automating this process would be to train a classifier for each category in the master taxonomy, and then classify objects from the source taxonomy into these categories. Our key insight is that the availability of the source taxonomy data could be helpful to build better classifiers in this scenario, therefore it would be beneficial to do transductive learning rather than inductive learning, i.e., learning to optimize classification performance on a particular set of test examples. In this paper, we attempt to use a powerful transductive learning algorithm, Spectral Graph Transducer (SGT), to attack this problem. Noticing that the categorizations of the master and source taxonomies often have some semantic overlap, we propose to further enhance SGT classifiers by incorporating the affinity information present in the taxonomy data. Our experiments with real-world web data show substantial improvements in the performance of taxonomy integration.

Dell Zhang, Xiaoling Wang, Yisheng Dong

Data Classification and Mining I

Contextual Probability-Based Classification

The k-Nearest-Neighbors (kNN) method for classification is simple but effective in many cases. The success of kNN in classification depends on the selection of a “good value” for k. In this paper, we proposed a contextual probability-based classification algorithm (CPC) which looks at multiple sets of nearest neighbors rather than just one set of k nearest neighbors for classification to reduce the bias of k. The proposed formalism is based on probability, and the idea is to aggregate the support of multiple neighborhoods for various classes to better reveal the true class of each new instance. To choose a series of more relevant neighborhoods for aggregation, three neighborhood selection methods: distance-based, symmetric-based, and entropy-based neighborhood selection methods are proposed and evaluated respectively. The experimental results show that CPC obtains better classification accuracy than kNN and is indeed less biased by k after saturation is reached. Moreover, the entropy-based CPC obtains the best performance among the three proposed neighborhood selection methods.

Gongde Guo, Hui Wang, David Bell, Zhining Liao

Improving the Performance of Decision Tree: A Hybrid Approach

In this paper, a hybrid learning approach named Flexible NBTree is proposed. Flexible NBTree uses Bayes measure δ to select proper test and applies post-discretization strategy to construct decision tree. The finial decision tree nodes contain univariate splits as regular decision trees, but the leaf nodes contain General Naive Bayes, which is a variant of standard Naive Bayesian classifier. Empirical studies on a set of natural domains show that Flexible NBTree has clear advantages with respect to the generalization ability when compared against its counterpart, NBTree.

LiMin Wang, SenMiao Yuan, Ling Li, HaiJun Li

Understanding Relationships: Classifying Verb Phrase Semantics

Relationships are an essential part of the design of a database because they capture associations between things. Comparing and integrating relationships from heterogeneous databases is a difficult problem, partly because of the nature of the relationship verb phrases. This research proposes a multi-layered approach to classifying the semantics of relationship verb phrases to assist in the comparison of relationships. The first layer captures fundamental, primitive relationships based upon well-known work in data abstractions and conceptual modeling. The second layer captures the life cycle of natural progressions in the business world. The third layer reflects the context-dependent nature of relationships. Use of the classification scheme is illustrated by comparing relationships from various application domains with different purposes.

Veda C. Storey, Sandeep Purao

Data Classification and Mining II

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

Maximal frequent itemsets mining is a fundamental and important problem in many data mining applications. Since the MaxMiner algorithm introduced the enumeration trees for MFI mining in 1998, there have been several methods proposed to use depth-first search to improve performance. This paper presents FIMfi, a new depth-first algorithm based on FP-tree and MFI-tree for mining MFI. FIMfi adopts a novel item ordering policy for efficient lookaheads pruning, and a simple method for fast superset checking. It uses a variety of old and new pruning techniques to prune the search space. Experimental comparison with previous work reveals that FIMfi reduces the number of FP-trees created greatly and is more than 40% superior to the similar algorithms on average.

Yuejin Yan, Zhoujun Li, Huowang Chen

Multi-phase Process Mining: Building Instance Graphs

Deploying process-driven information systems is a time-con-suming and error-prone task. Process mining attempts to improve this by automatically generating a process model from event-based data. Existing techniques try to generate a complete process model from the data acquired. However, unless this model is the ultimate goal of mining, such a model is not always required. Instead, a good visualization of each individual process instance can be enough. From these individual instances, an overall model can then be generated if required. In this paper, we present an approach which constructs an instance graph for each individual process instance, based on information in the entire data set. The results are represented in terms of Event-driven Process Chains (EPCs). This representation is used to connect our process mining to a widely used commercial tool for the visualization and analysis of instance EPCs.

B. F. van Dongen, W. M. P. van der Aalst

A New XML Clustering for Structural Retrieval

XML becomes increasingly important in data exchange and information management. Starting point for retrieving the information and integrating the documents efficiently is clustering the documents that have similar structure. Thus, in this paper, we propose a new XML document clustering method based on similar structure. Our approach first extracts the representative structures of XML documents by sequential pattern mining. And then we cluster XML documents of similar structure using the clustering algorithm for transactional data, assuming that an XML document as a transaction and the frequent structure of documents as the items of the transaction. We also apply our technique to XML retrieval. Our experiments show the efficiency and good performance of the proposed clustering method.

Jeong Hee Hwang, Keun Ho Ryu

Web-Based Information Systems

Link Patterns for Modeling Information Grids and P2P Networks

Collaborative work requires, more than ever, access to data located on multiple autonomous and heterogeneous data sources. The development of these novel information platforms, referred to as information or data grids, and the evolving databases based on P2P concepts, need appropriate modeling and description mechanisms. In this paper we propose the Link Pattern Catalog as a modeling guideline for recurring problems appearing during the design or description of information grids and P2P networks. For this purpose we introduce the Data Link Modeling Language, a language for describing and modeling virtually any kind of data flows in information sharing environments.

Christopher Popfinger, Cristian Pérez de Laborda, Stefan Conrad

Information Retrieval Aware Web Site Modelling and Generation

Design and maintenance of large corporate Web sites have become a challenging problem due to the continuing increase in their size and complexity. One particular feature present in the majority of this sort of Web sites is searching for information. However the solutions provided so far, which is based on the same techniques used for search in the open Web, have not provided a satisfactory performance to specific Web sites, often resulting in too much irrelevant content in a query answer. This paper proposes an approach to Web site modelling and generation of intrasite search engines, combining application modelling and information retrieval techniques. Our assumption is that giving search engines access to the information provided by conceptual representations of the Web site improves their performance and accuracy. We demonstrate our proposal by describing a Web site modelling language that represent both traditional modelling features and information retrieval aspects, as well as presenting experiments to evaluate the resulting intrasite search engine generated by our method.

Keyla Ahnizeret, David Fernandes, João M. B. Cavalcanti, Edleno Silva de Moura, Altigran S. da Silva

Expressive Profile Specification and Its Semantics for a Web Monitoring System

World wide web has gained a lot of prominence with respect to information retrieval and data delivery. With such a prolific growth, a user interested in a specific change has to continuously retrieve/pull information from the web and analyze it. This results in wastage of resources and more importantly the burden is on the user. Pull-based retrieval needs to be replaced with a push-based paradigm for efficiency and notification of relevant information in a timely manner. WebVigiL is an efficient profile-based system to monitor, retrieve, detect and notify specific changes to HTML and XML pages on the web. In this paper, we describe the expressive profile specification language along with its semantics. We also present an efficient implementation of these profiles. Finally, we present the overall architecture of the WebVigiL system and its implementation status.

Ajay Eppili, Jyoti Jacob, Alpa Sachde, Sharma Chakravarthy

Query Processing I

On Modelling Cooperative Retrieval Using an Ontology-Based Query Refinement Process

In this paper we present an approach for the interactive refinement of ontology-based queries. The approach is based on generating a lattice of the refinements, that enables a step-by-step tailoring of a query to the current information needs of a user. These needs are implicitly elicited by analysing the user’s behaviour during the searching process. The gap between a user’s need and his query is quantified by measuring several types of query ambiguities, which are used for ranking of the refinements. The main advantage of the approach is a more cooperative support in the refinement process: by exploiting the ontology background, the approach supports finding “similar” results and enables efficient relaxing of failing queries.

Nenad Stojanovic, Ljiljana Stojanovic

Load-Balancing Remote Spatial Join Queries in a Spatial GRID

The explosive growth of spatial data worldwide coupled with the emergence of GRID computing provides a strong motivation for designing a spatial GRID which allows transparent access to geographically distributed data. While different types of queries may be issued from any node in such a spatial GRID for retrieving the data stored at other (remote) nodes in the GRID, this paper specifically addresses spatial join queries. Incidentally, skewed user access patterns may cause a disproportionately large number of spatial join queries to be directed to a few ‘hot’ nodes, thereby resulting in severe load imbalance and consequently increased user response times. This paper focusses on load-balanced spatial join processing in a spatial GRID.

Anirban Mondal, Masaru Kitsuregawa

Expressing and Optimizing Similarity-Based Queries in SQL

(Extended Abstract)

Searching for similar objects (in terms of near and nearest neighbors) of a given query object from a large set is an essential task in many applications. Recent years have seen great progress towards efficient algorithms for this task. This paper takes a query language perspective, equipping SQL with the near and nearest search capability by adding a user-defined-predicate, called NN-UDP. The predicate indicates, among a set of objects, if an object is a near or nearest-neighbor of a given query object. The use of the NN-UDP makes the queries involving similarity searches intuitive to express. Unfortunately, traditional cost-based optimization methods that deal with traditional UDPs do not work well for such SQL queries. Better execution plans are possible with the introduction of a new operator, called NN-OP, which finds the near or nearest neighbors from a set of objects for a given query object. An optimization algorithm proposed in this paper can produce these plans that take advantage of the efficient search algorithms developed in recent years. To assess the proposed optimization algorithm, this paper focuses on applications that deal with streaming time series. Experimental results show that the optimization strategy is effective.

Like Gao, Min Wang, X. Sean Wang, Sriram Padmanabhan

Query Processing II

XSLTGen: A System for Automatically Generating XML Transformations via Semantic Mappings

XML is rapidly emerging as a dominant standard for representing and exchanging information. The ability to transform and present data in XML is crucial and XSLT is a relatively recent programming language, specially designed to support this activity. Despite its utility, however, XSLT is widely considered a difficult language to learn.In this paper, we present XSLTGen: An Automatic XSLT Generator, a novel system that automatically generates an XSLT stylesheet, given a source XML document and a desired output HTML or XML document. It allows users to become familiar with and learn XSLT, based solely on their knowledge of XML or HTML. Our method is based on the use of semantic mappings between the input and output documents. We show how such mappings can be first discovered and then employed to create XSLT stylesheets. The results of our experiments show that XSLTGen works well with different varieties of XML and HTML documents.

Stella Waworuntu, James Bailey

Efficient Recursive XML Query Processing in Relational Database Systems

There is growing evidence that schema-conscious approaches are a better option than schema-oblivious techniques as far as XML query performance is concerned in relational environment. However, the issue of recursive XML queries for such approaches has not been dealt with satisfactorily. In this paper we argue that it is possible to design a schema-oblivious approach that outperforms schema-conscious approaches for certain types of recursive queries. To that end, we propose a novel schema-oblivious approach called Sucxent++ that outperforms existing schema-oblivious approaches such as XParent by up to 15 times and schema-conscious approaches (Shared-Inlining) by up to 3 times for recursive query execution. Our approach has up to 2 times smaller storage requirements compared to existing schema-oblivious approaches and 10% less than schema-conscious techniques. In addition, existing schema-oblivious approaches are hampered by poor query plans generated by the relational query optimizer. We propose optimizations in the XML query to SQL translation process that generate queries with more optimal query plans.

Sandeep Prakash, Sourav S. Bhowmick, Sanjay Madria

Situated Preferences and Preference Repositories for Personalized Database Applications

Advanced personalized web applications require a carefully dealing with their users’ wishes and preferences. Since such preferences do not always hold in general, personalized applications also have to consider the user’s current situation. In this paper we present a novel framework for modeling situations and situated preferences. Our approach consists of a general meta model for situations, which can be applied as foundation for situation models in a wide range of applications. Furthermore, an XML-based preference repository for the storage and management of situated preferences is developed. Long-term and situated preferences can easily be accessed with the preference repository interface. Particularly, preferences best-matching to a given situation can be queried. This approach allows web applications to react flexibly and personalized to the changing situations of their users.

Stefan Holland, Werner Kießling

Web Services I

Analysis and Management of Web Service Protocols

In the area of Web services and service-oriented architectures, business protocols are rapidly gaining importance and mindshare as a necessary part of Web service descriptions. Their immediate benefit is that they provide developers with information on how to write clients that can correctly interact with a given service or with a set of services. In addition, once protocols become an accepted practice and service descriptions become endowed with protocol information, the middleware can be significantly extended to better support service development, binding, and execution in a number of ways, considerably simplifying the whole service life-cycle. This paper discusses the different ways in which the middleware can leverage protocol descriptions, and focuses in particular on the notions of protocol compatibility, equivalence, and replace-ability. They characterise whether two services can interact based on their protocol definition, whether a service can replace another in general or when interacting with specific clients, and which are the set of possible interactions among two services.

Boualem Benatallah, Fabio Casati, Farouk Toumani

Semantic Interpretation and Matching of Web Services

A major issue in the study of semantic Web services concerns the matching problem of Web services. Various techniques for this problem have been proposed. Typical ones include FSM modeling, DAML-S ontology matching, description logics reasoning, and WSDL dual operation composition. They often assume the availability of concept semantic relations, based on which the capability satisfiability is evaluated. However, we find that the use of semantic relations alone in the satisfiability evaluation may lead to inappropriate results. In this paper, we study the problem and classify the existing techniques of satisfiability evaluation into three approaches, namely, set inclusion checking, concept coverage comparison and concept subsumption reasoning. Two different semantic interpretations, namely, capacity interpretation and restriction interpretation, are identified. However, each of the three approaches assumes only one interpretation and its evaluation is inapplicable to the other interpretation. To address this limitation, a novel interpretation model, called CRI model, is formulated. This model supports both semantic interpretations, and allows the satisfiability evaluation to be uniformly conducted. Finally, we present an algorithm for the unified satisfiability evaluation.

Chang Xu, Shing-Chi Cheung, Xiangye Xiao

Intentional Modeling to Support Identity Management

Identity management has arisen as a major and urgent challenge for internet-based communications and information services. Internet services involve complex networks of relationships among users and providers – human and automated – acting in many different capacities under interconnected and dynamic contexts. There is a pressing need for frameworks and models to support the analysis and design of complex social relationships and identities in order to ensure the effective use of existing protection technologies and control mechanisms. Systematic methods are needed to guide the design, operation, administration, and maintenance of internet services, in order to address complex issues of security, privacy, trust and risk, as well as interactions in functionality. All of these rely on sophisticated concepts for identity and techniques for identity management.We propose using a requirements modeling framework GRL to facilitate identity management for Internet Services. Using this modeling approach, we are able to represent different types of identities, social dependencies between identity users and owners, service users and providers, and third party mediators. We may also analyze the strategic rationales of business players/stakeholders in the context of identity management. This modeling approach will help identity management technology vendors to provide customizable solutions, user organizations to form integrated identity management solution, system operators and administrators to accommodate changes, and policy auditors to enforce information protection principles, e.g., Fair Information Practice Principles.

Lin Liu, Eric Yu

Web Services II

WUML: A Web Usage Manipulation Language for Querying Web Log Data

In this paper, we develop a novel Web Usage Manipulation Language (WUML) which is a declarative language for manipulating Web log data. We assume that a set of trails formed by users during the navigation process can be identified from Web log files. The trails are dually modelled as a transition graph and a navigation matrix with respect to the underlying Web topology. A WUML expression is executed by transforming it into Navigation Log Algebra (NLA), which consists of the sum, union, difference, intersection, projection, selection, power and grouping operators. As real navigation matrices are sparse, we perform a range of experiments to study the impact of using different matrix storage schemes on the performance of the NLA.

Qingzhao Tan, Yiping Ke, Wilfred Ng

An Agent-Based Approach for Interleaved Composition and Execution of Web Services

The emerging paradigm of web services promises to bring to distributed computing the same flexibility that the web has brought to the publication and search of information contained in documents. This new paradigm puts severe demands on composition and execution of workflows that must survive and respond to changes in the computing and business environments. Workflows facilitated by web services must, therefore, allow dynamic composition in ways that cannot be predicted in advance. Utilizing the notions of shared mental models and proactive information exchange in agent teamwork research, we propose a solution that interleaves planning and execution in a distributed manner. This paper proposes a generic model, gives the mappings of terminology between Web services and team-based agents, describes a comprehensive architecture for realizing the approach, and demonstrates its usefulness with the help of an example. A key benefit of the approach is the proactive failures handling that may be encountered during execution of complex web services.

Xiaocong Fan, Karthikeyan Umapathy, John Yen, Sandeep Purao

A Probabilistic QoS Model and Computation Framework for Web Services-Based Workflows

Web services promise to become a key enabling technology for B2B e-commerce. Several languages have been proposed to compose Web services into workflows. The QoS of the Web services-based workflows may play an essential role in choosing constituent Web services and determining service level agreement with their users. In this paper, we identify a set of QoS metrics in the context of Web services and propose a unified probabilistic model for describing QoS values of (atomic/composite) Web services. In our model, each QoS measure of a Web service is regarded as a discrete random variable with probability mass function (PMF). We describe a computation framework to derive QoS values of a Web services-based workflow. Two algorithms are proposed to reduce the sample space size when combining PMFs. The experimental results show that our computation framework is efficient and results in PMFs that are very close to the real model.

San-Yih Hwang, Haojun Wang, Jaideep Srivastava, Raymond A. Paul

Schema Evolution

Lossless Conditional Schema Evolution

Conditional schema changes change the schema of the tuples that satisfy the change condition. When the schema of a relation changes some tuples may no longer fit the current schema. Handling the mismatch between the intended schema of tuples and the recorded schema of tuples is at the core of a DBMS that supports schema evolution. We propose to keep track of schema mismatches at the level of individual tuples, and prove that evolving schemas with conditional schema changes, in contrast to database systems relying on data migration, are lossless when the schema evolves. The lossless property is a precondition for a flexible semantics that allows to correctly answer general queries over evolving schemas. The key challenge is to handle attribute mismatches between the intended and recorded schema in a consistent way. We provide a parametric approach to resolve mismatches according to the needs of the application. We introduce the mismatch extended completed schema (MECS) which records attributes along with their mismatches, and we prove that relations with MECS are lossless.

Ole G. Jensen, Michael H. Böhlen

Ontology-Guided Change Detection to the Semantic Web Data

The Semantic Web is envisioned as the next generation web in which data instances are enriched with metadata defined in ontologies to describe the meaning of its instances. In this paper, we present an approach that exploits ontologies in guiding the change detection to their data instances. Inference rules are identified based on the semantic relationships among concepts, properties and instances as well as their change behaviors. Starting with changes to some seed instances, a reasoning engine is designed to fire the pre-defined rule set and act on ontologies to project some semantically associated concepts as target concepts. Certain instances of these target concepts are further selected as target instances, which have a high likelihood of having changed. Our approach is specifically oriented toward the Semantic Web, thus it has intelligence to exploit the semantic associations among data instances and make smart decisions.

Li Qin, Vijayalakshmi Atluri

Schema Evolution in Data Warehousing Environments – A Schema Transformation-Based Approach

In heterogeneous data warehousing environments, autonomous data sources are integrated into a materialised integrated database. The schemas of the data sources and the integrated database may be expressed in different modelling languages. It is possible for either the data source schemas or the warehouse schema to evolve. This evolution may include evolution of the schema, or evolution of the modelling language in which the schema is expressed, or both. In such scenarios, it is important for the integration framework to be evolvable, so that the previous integration effort can be reused as much as possible. This paper describes how the AutoMed heterogeneous data integration toolkit can be used to handle the problem of schema evolution in heterogeneous data warehousing environments. This problem has been addressed before for specific data models, but AutoMed has the ability to cater for multiple data models, and for changes to the data model.

Hao Fan, Alexandra Poulovassilis

Conceptual Modeling Applications I

Metaprogramming for Relational Databases

For systems that share enough structural and functional commonalities, reuse in schema development and data manipulation can be achieved by defining problem-oriented languages. Such languages are often called domainspecific, because they introduce powerful abstractions meaningful only within the domain of observed systems. In order to use domain-specific languages for database applications, a mapping to SQL is required. In this paper, we deal with metaprogramming concepts required for easy definition of such mappings. Using an example domain-specific language, we provide an evaluation of mapping performance.

Jernej Kovse, Christian Weber, Theo Härder

Incremental Navigation: Providing Simple and Generic Access to Heterogeneous Structures

We present an approach to support incremental navigation of structured information, where the structure is introduced by the data model and schema (if present) of a data source. Simple browsing through data values and their connections is an effective way for a user or an automated system to access and explore information. We use our previously defined Uni-Level Description (ULD) to represent an information source explicitly by capturing the source’s data model, schema (if present), and data values. We define generic operators for incremental navigation that use the ULD directly along with techniques for specifying how a given representation scheme can be navigated. Because our navigation is based on the ULD, the operations can easily move from data to schema to data model and back, supporting a wide range of applications for exploring and integrating data. Further, because the ULD can express a broad range of data models, our navigation operators are applicable, without modification, across the corresponding model or schema. In general, we believe that information sources may usefully support various styles of navigation, depending on the type of user and the user’s desired task.

Shawn Bowers, Lois Delcambre

Agent Patterns for Ambient Intelligence

The realization of complex distributed applications, required in areas such as e-Business, e-Government, and ambient intelligence, calls for new development paradigms, such as the Service Oriented Computing approach which accommodates for dynamic and adaptive interaction schemata, carried on on a per-to-peer level. Multi Agent Systems offer the natural architectural solutions to several requirements imposed by such an adaptive approach.This work discusses the limitation of common agent patterns, typically adopted in distributed information systems design, when applied to service oriented computing, and introduces two novel agent patterns, that we call Service Oriented Organization and Implicit Organization Broker agent pattern, respectivelly. Some design aspects of the Implicit Organization Broker agent pattern are also presented. The limitations and the proposed solutions are demonstrated in the development of a multi agent system which implements a pervasive museum visitors guide. Some of the architecture and design features serve as a reference scenario for the demonstration of both the current methods limitations and the contribution of the newly proposed agent patterns and associated communication framework.

Paolo Bresciani, Loris Penserini, Paolo Busetta, Tsvi Kuflik

Conceptual Modeling Applications II

Modeling the Semantics of 3D Protein Structures

The post Human Genome Project era calls for reliable, integrated, flexible, and convenient data management techniques to facilitate research activities. Querying biological data that is large in volume and complex in structure such as 3D proteins requires expressive models to explicitly support and capture the semantics of the complex data. Protein 3D structure search and comparison not only enable us to predict unknown structures, but can also reveal distant evolutionary relationships that are otherwise undetectable, and perhaps suggest unsuspected functional properties. In this work, we model 3D protein structures by adding spatial semantics and constructs to represent the contributing forces such as hydrogen bonds and high-level structures such as protein secondary structures. This paper makes a contribution to modeling the specialty of life science data and develops methods to meet the novel challenges posed by such data.

Sudha Ram, Wei Wei

Risk-Driven Conceptual Modeling of Outsourcing Decisions

In the current networked world, outsourcing of information technology or even of entire business processes is often a prominent design alternative. In the general case, outsourcing is the distribution of economically viable activities over a collection of networked organizations. To evaluate outsourcing decision alternatives, we need to make a conceptual model of each of them. However, in an outsourcing situation, many actors are involved that are reluctant to spend too many resources on exploring alternatives that are not known to be cost-effective. Moreover, the particular risks involved in a specific outsourcing decision have to be identified as early as possible to focus the decision-making process. In this paper, we present a risk-driven approach to conceptual modeling of outsourcing decision alternatives, in which we model just enough of each alternative to be able to make the decision. We illustrate our approach with an example.

Pascal van Eck, Roel Wieringa, Jaap Gordijn

A Pattern and Dependency Based Approach to the Design of Process Models

In this paper an approach for building process models for e-commerce is proposed. It is based on the assumption that the process modeling task can be methodologically supported by a designers assistant. Such a foundation provides justifications, expressible in business terms, for design decisions made in process modeling, thereby facilitating communication between systems designers and business users. Two techniques are utilized in the designers assistant, namely process patterns and action dependencies. A process pattern is a generic template for a set of interrelated activities between two agents, while an action dependency expresses a sequential relationship between two activities.

Maria Bergholtz, Prasad Jayaweera, Paul Johannesson, Petia Wohed

UML

Use of Tabular Analysis Method to Construct UML Sequence Diagrams

A sequence diagram in UML is used to model interactions among objects that participate in a use case. Developing a sequence diagram is complex; our experience shows that novice developers have significant difficulty. In earlier work, we presented a ten-step heuristic method for developing sequence diagrams. This paper presents a tabular analysis method (TAM) which improves on the ten-step heuristic method. TAM analyzes the message requirements of the use case, while documenting the resulting analysis in a tabular format. The resulting table is referenced to build the sequence diagram. This process aids novice modelers by separating the problem analysis from the learning curve of a modeling tool. Building sequence diagrams with the systematic approach of TAM facilitates consistency with the use case model and the class model. We found that developers effectively developed sequence diagrams using TAM.

Margaret Hilsbos, Il-Yeol Song

An Approach to Formalizing the Semantics of UML Statecharts

UML is a language for specifying, visualizing and documenting object-oriented systems. However, UML statecharts lack precisely defined syntax and semantics. This paper provides a method of formalizing semantics of UML statecharts with Z. According to this precise semantics, UML statecharts are transformed into FREE (Flattened Regular Expression) state models. The hierarchical and concurrent structure of states is flattened in the resulting FREE state model. The model helps to determine whether the software design is consistent, unambiguous and complete. It is also beneficial to software testing.

Xuede Zhan, Huaikou Miao

Applying the Application-Based Domain Modeling Approach to UML Structural Views

Being part of domain engineering, domain analysis enables identifying domains and capturing their ontologies in order to assist and guide system developers to design domain-specific applications. Several studies suggest using metamodeling techniques for modeling domains and their constraints. However, these techniques use different notions, and sometimes even different notations, for defining domains and their constraints and for specifying and designing the domain-specific applications. We propose an Application-based DOmain Modeling (ADOM) approach in which domains are treated as regular applications that need to be modeled before systems of those domains are specified and designed. This way, the domain models enforce static and dynamic constraints on their application models. The ADOM approach consists of three-layers and defines dependency and enforcement relations between these layers. In this paper we describe the ADOM architecture and validation rules focusing on applying them to UML static views, i.e., class, component, and deployment diagrams.

Arnon Sturm, Iris Reinhartz-Berger

XML Modeling

A Model Driven Approach for XML Database Development

In this paper we propose a methodological approach for the development of XML databases. Our proposal is framed in MIDAS, a model driven methodology for the development of Web Information Systems (WISs) based on the Model Driven Architecture (MDA) proposed by the Object Management Group (OMG). So, in this framework, the proposed data Platform Independent Model (PIM) is the conceptual data model and the data Platform Specific Model (PSM) is the XML Schema model. Both of them will be represented in UML, therefore we also summarize in this work an extension to UML for XML Schema. Moreover, we define the mappings to transform the data PIM into the data PSM, which will be the XML database schema. The development process of the XML database will be shown by means of a case study: a WIS for the management of medical images stored in the XML DB of Oracle.

Belén Vela, César J. Acuña, Esperanza Marcos

On the Updatability of XML Views Published over Relational Data

Updates over virtual XML views that wrap the relational data have not been well supported by current XML data management systems. This paper studies the problem of the existence of a correct relational update translation for a given view update. First, we propose a clean extended-source theory to decide whether a translation mapping is correct. Then to answer the question of the existence of a correct mapping, we classify a view update as either un-translatable, conditionally or unconditionally translatable under a given update translation policy. We design a graph-based algorithm to classify a given update into one of the three update categories based on schema knowledge extracted from the XML view and the relational base. This now represents a practical approach that could be applied by any existing view update system in industry and in academic for analyzing the translatability of a given update statement before translation of it is attempted.

Ling Wang, Elke A. Rundensteiner

XBiT: An XML-Based Bitemporal Data Model

Past research work on modeling and managing temporal information has, so far, failed to elicit support in commercial database systems. The increasing popularity of XML offers a unique opportunity to change this situation, inasmuch as XML and XQuery support temporal information much better than relational tables and SQL. This is the important conclusion claimed in this paper where we show that valid-time, transaction-time, and bitemporal databases can be naturally viewed in XML using temporally-grouped data models. Then, we show that complex historical queries, that would be very difficult to express in SQL on relational tables, can now be easily expressed in standard XQuery on such XML-based representations. We first discuss the management of transaction-time and valid-time histories and then extend our approach to bitemporal histories. The approach can be generalized naturally to support the temporal management of arbitrary XML documents and queries on their version history.

Fusheng Wang, Carlo Zaniolo

Industrial Presentations I: Applications

Enterprise Cockpit for Business Operation Management

The area of business operations monitoring and management is rapidly gaining impor-tance both in the industry and in the academia. This is demonstrated by the large number of performance reporting tools that have been developed. Such tools essen-tially leverage system monitoring and data warehousing applications to perform online analysis of business operations and produce fancy charts, from which users can get the feeling of what is happening in the system. While this provides value, there is still a huge gap between what is available today and what users would ideally like to have: Business analysts tend to think of the way business operations are performed in terms of high level business processes, that we will call abstract in the following. There is no way today for analyst to draw such abstract processes and use them as a metaphor for analyzing business operations.Defining metrics of interest and reporting against these metrics requires a signifi-cant coding effort. No system provides, out of the box, the facility for easily defin-ing metrics over process execution data, for providing users with explanations for why a metric has a certain value, and for predicting the future value for a metric.There is no automated support for identifying optimal configurations of the busi-ness processes to improve critical metrics.There is no support for understanding the business impact of system failures.

Fabio Casati, Malu Castellanos, Ming-Chien Shan

Modeling Autonomous Catalog for Electronic Commerce

The catalog function is an essential feature in B2C and B2B e-commerce. While catalog is primarily for end users to navigate and search for interested products, other e-commerce functions such as merchandising, order, inventory and aftermarket constantly refer to information stored in the catalog [1]. The billion-dollar mail order business was created around catalog long before e-commerce. More opportunities surface after catalog content previously created on paper is digitized. While catalog is recognized as a necessity for a successful web store, its content structure varies greatly across industries and also within each industry. Product categories, attributes, measurements, languages, and currency all contribute to the wide variations, which create a difficult dilemma for catalog designers.

Yuan-Chi Chang, Vamsavardhana R. Chillakuru, Min Wang

GiSA: A Grid System for Genome Sequences Assembly

Sequencing genomes is a fundamental aspect of biological research. Shotgun sequencing, since introduced by Sanger et al [2], has remained the mainstay in the research field of genome sequence assembly. This method randomly obtains sequence reads (e.g. a subsequence including about 500 characters) from a genome and then assemblies them into contigs based on significant overlap among them. The whole-genome shotgun (WGS) approach, generates sequence reads directly from a whole-genome library and uses computational techniques to reassemble them. A variety of assembly programs have been previously proposed and implemented, including PHRAP [3] (Green 1994), CAP3 [4] (1999), Celera [5] (2000) etc. Because of great computational complexity and increasingly large size, they incur great time and space overhead. PHRAP [3], for instance, which can only run in a stand-alone way, requires many times memory (usually greater than 10) as the size of original sequence data. In realistic applications, sequencing process might come to become unacceptably slow for insufficient memory even with a mainframe with huge RAM.

Jun Tang, Dong Huang, Chen Wang, Wei Wang, Baile Shi

Industrial Presentations II: Ontology in Applications

Analytical View of Business Data: An Example

This paper describes an example of how the Analytical View (AV) in Microsoft Business Framework (MBF) works. AV consists of three components: Design time Model Service, Business Intelligence Entity (BIE) programming model, and the runtime Intell-Drill for navigation between OLTP and OLAP data sources. Model Service transforms an “object model (transactional view)” to a “multi-dimensional model (analytical view).” It infers dimensionality from the object layer where richer metadata is stored, eliminating the guesswork that a traditional data warehousing process requires. Model Service also generates BI Entity classes that enable a consistent object oriented programming model with strong types and rich semantics for OLAP data. Intelli-Drill links together all the information in MBF using metadata, making information navigation in MBF fully discover-able.

Adam Yeh, Jonathan Tang, Youxuan Jin, Sam Skrivan

Ontological Approaches to Enterprise Applications

One of the main challenges in building enterprise applications has been to balance between general functionality and domain/ scenario-specific customization. The lack of formal ways to extract, distill, and standardize the embedded domain knowledge has been a barrier to minimizing the cost of customization. Using ontology, as many would hope, will give application builders the much needed methodology and standard to achieve the objective of building flexible enterprise solutions [1, 2].However, even with a rich amount of research and quite a few excellent results on designing and building ontologies [3, 4], there are still gaps to be filled for actual deployment of the technology and concept in a real life commercial environment. The problems are hard especially in those applications that require well-defined semantics in mission critical operations. In this presentation, we introduce two of our projects where ontological approaches are used for enterprise applications. Based on these experiences we discuss the challenges in applying ontology-based technologies to solving business applications.

Dongkyu Kim, Yuan-Chi Chang, Juhnyoung Lee, Sang-goo Lee

FASTAXON: A System for FAST (and Faceted) TAXONomy Design

Building very big taxonomies is a laborious task vulnerable to errors and management/scalability deficiencies. FASTAXON is a system for building very big taxonomies in a quick, flexible and scalable manner that is based on the faceted classification paradigm [4] and the Compound Term Composition Algebra [5]. We sketch the architecture and the functioning of this system and we report our experiences from using this system in real applications.

Yannis Tzitzikas, Raimo Launonen, Mika Hakkarainen, Pekka Korhonen, Tero Leppänen, Esko Simpanen, Hannu Törnroos, Pekka Uusitalo, Pentti Vänskä

CLOVE: A Framework to Design Ontology Views

The management and exchange of knowledge in the Internet has become the cornerstone of technological and commercial progress. In this fast-paced environment, the competitive advantage belongs to those businesses and individuals that can leverage the unprecedented richness of web information to define business partnerships, to reach potential customers and to accommodate the needs of these customers promptly and flexibly. The Semantic Web vision is to provide a standard information infrastructure that will enable intelligent applications to automatically or semi-automatically carry out the publication, the searching, and the integration of information on the Web. This is to be accomplished by semantically annotating data and by using standard inferencing mechanisms on this data. This annotation would allow applications to understand, say, dates and time intervals regardless of their syntactic representation. For example, in the e-business context, an online catalog application could include the expected delivery date of a product based on the schedules of the supplier, the shipping times of the delivery company and the address of the customer. The infrastructure envisioned by the Semantic Web would guarantee that this can be done automatically by integrating the information of the online catalog, the supplier and the delivery company. No changes to the online catalog application would be necessary when suppliers and delivery companies change. No syntactic mapping of metadata will be necessary between the three data repositories.To accomplish this, two things are necessary: (1) the data structures must be rich enough to represent the complex semantics of products and services and the various ways in which these can be organized; and (2) there must be flexible customization mechanisms that enable multiple customers to view and integrate these products and services with their own categories. Ontologies are the answer to the former, ontology views are the key to the latter.

Rosario Uceda-Sosa, Cindy X. Chen, Kajal T. Claypool

Demos and Posters

iRM: An OMG MOF Based Repository System with Querying Capabilities

In this work we present iRM – an OMG MOF-compliant repository system that acts as custom-defined application or system catalogue. iRM enforces structural integrity using a novel approach. iRM provides declarative querying support. iRM finds use in evolving data intensive applications, and in fields where integration of heterogeneous models is needed.

Ilia Petrov, Stefan Jablonski, Marc Holze, Gabor Nemes, Marcus Schneider

Visual Querying for the Semantic Web

This paper presents a demonstration of visXcerpt [BBS03,BBSW03], a visual query language for both, standard Web as well as Semantic Web applications.

Sacha Berger, François Bry, Christoph Wieser

Query Refinement by Relevance Feedback in an XML Retrieval System

In recent years, ranked retrieval systems for heterogeneous XML data with both structural search conditions and keyword conditions have been developed for digital libraries, federations of scientific data repositories, and hopefully portions of the ultimate Web. These systems, such as XXL [2], are based on pre-defined similarity measures for atomic conditions (using index structures on contents, paths and ontological relationships) and then use rank aggregation techniques to produce ranked result lists. An ontology can play a positive role for term expansion [2], by improving the average precision and recall in the INEX 2003 benchmark [3].Due to the users’ lack of information on the structure and terminology of the underlying diverse data sources, and the complexity of the (powerful) query language, users can often not avoid posing overly broad or overly narrow initial queries, thus getting either too many or too few results. For the user, it is more appropriate and easier to provide relevance judgments on the best results of an initial query execution, and then refine the query, either interactively or automatically by the system. This calls for applying relevance feedback technology in the new area of XML retrieval [1].The key question is how to appropriately generate a refined query based on a user’s feedback in order to obtain more relevant results among the top-k result list. Our demonstration will show an approach for extracting user information needs by relevance feedback, maintaining more intelligent personal ontologies, clarifying uncertainties, re-weighting atomic conditions, expanding query, and automatically generating a refined query for the XML retrieval system XXL.

Hanglin Pan, Anja Theobald, Ralf Schenkel

Semantics Modeling for Spatiotemporal Databases

How to model spatiotemporal changes is one of the key issues in the researches on spatiotemporal databases. Due to the inefficiency of previous spatiotemporal data models [1, 2], none of them has been widely accepted so far.This paper investigates the types of spatiotemporal changes and the approach to describing spatiotemporal changes. The semantics of spatiotemporal changes are studied and a systematic classification on spatiotemporal changes is proposed, based on which a framework of spatiotemporal semantic model is presented.

Peiquan Jin, Lihua Yue, Yuchang Gong

Temporal Information Management Using XML

A closer integration of XML and database systems is actively pursued by researchers and vendors because of the many practical benefits it offers. Additional special benefits can be achieved on temporal information management – an important application area that represents an unsolved challenge for relational databases [1]. Indeed, XML data model and query languages support: Temporally grouped representations that have long been recognized as a natural data model for historical information [2], andTuring-complete query languages, such as XQuery [3], where all the constructs needed for temporal queries can be introduced as user-defined libraries, without requiring extensions to existing standards.By contrast, the flat relational tables of traditional DBMSs are not well-suited for temporally grouped representations [4]; moreover, significant extensions are required to support temporal information in SQL and, in the past, they were poorly received by SQL standard committees.We will show that (i) XML hierarchical structure can naturally represent the history of databases and XML documents via temporally-grouped data models, and (ii) powerful temporal queries can be expressed in XQuery without requiring any extension to current standards. This approach is quite general and, in addition to the evolution history of databases, it can be used to support the version history of XML documents for transaction-time, valid-time, and bitemporal chronicles [5]. We will demo the queries discussed in [5] and show that this approach leads to simple programming environments that are fully-integrated with current XML tools and commercial DBMSs.

Fusheng Wang, Xin Zhou, Carlo Zaniolo

SVMgr: A Tool for the Management of Schema Versioning

The SVMgr tool is an integrated development environment for the management of a relational database supporting schema versioning, based on the multi-pool implementation solution [2]. In a few words, the multi-pool solution allows the extensional data connected to each schema version (data pool) to evolve independently from each other. The multi-pool solution is more flexible and potentially useful for advanced applications as it allows the coexistence of different full-fledged conceptual viewpoints on the mini-world modeled by the database [5], and it has partially been adopted also by other authors [3]. The multi-pool implementation underlying the SVMgr tool is based on the Logical Storage Model presented in [4] and allows the underlying multi-version database to be implemented on top of MS Access. The software prototype has been written in Java (it is downward compatible with the 1.2 version) and interacts with the underlying database via JDBC/ODBC on a MS Windows platform.

Fabio Grandi

GENNERE: A Generic Epidemiological Network for Nephrology and Rheumatology

GENNERE is a networked information system designed to answer epidemiological needs. Based on a French experiment in the field of End-Stage Renal Diseases (ESRD), it has been thought of so as to be adapted to Chinese medical needs and administrative rules. It has been implemented for nephrology and rheumatology at the Rui Jin hospital in Shanghai, but its design and implementation have been guided by genericity in order to make easier its adaptation and extension to other diseases and other countries. The genericity aspects have been considered at the levels of events design, database design and production, and software design. This first experiment in China leads to some conclusions about the adaptability of the system to several diseases and the multilinguality of the interface and in medical terminologies.

Ana Simonet, Michel Simonet, Cyr-Gabin Bassolet, Sylvain Ferriol, Cédric Gueydan, Rémi Patriarche, Haijin Yu, Ping Hao, Yi Liu, Wen Zhang, Nan Chen, Michel Forêt, Philippe Gaudin, Georges De Moor, Geert Thienpont, Mohamed Ben Saïd, Paul Landais, Didier Guillon

Panel

Panel: Beyond Webservices – Conceptual Modelling for Service Oriented Architectures

Webservices are evolving as the paradigm for loosely coupled architectures. The prospect of automatically composing complex processes from simple services is promising. However, a number of open issues remain: Which aspects of service semantics need to be explicated? Does it suffice to just model datastructures and interfaces, or do we also need process descriptions, behavioral semantics, and quality of service specifications? How can we deal with heterogeneous service descriptions? Should we use shared ontologies or adhoc mappings? This panel shall discuss to which extent established techniques from conceptual modelling can help in describing services to enable their discovery, selection, composition, negotiation, and invocation.

Peter Fankhauser

Springer Professional

Inhaltsverzeichnis

Frontmatter