Top

2009 | Book

Read chapter Read first chapter

Advances in Data Management

Editors: Zbigniew W. Ras, Agnieszka Dardzinska

Publisher: Springer Berlin Heidelberg

Book Series : Studies in Computational Intelligence

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

Data Management is the process of planning, coordinating and controlling data resources. More often, applications need to store and search a large amount of data. Managing Data has been continuously challenged by demands from various areas and applications and has evolved in parallel with advances in hardware and computing techniques.

This volume focuses on its recent advances and it is composed of five parts and a total of eighteen chapters. The first part of the book contains five contributions in the area of information retrieval and Web intelligence: a novel approach to solving index selection problem, integrated retrieval from Web of documents and data, bipolarity in database querying, deriving data summarization through ontologies, and granular computing for Web intelligence. The second part of the book contains four contributions in knowledge discovery area. Its third part contains three contributions in information integration and data security area. The remaining two parts of the book contain six contributions in the area of intelligent agents and applications of data management in medical domain.

Frontmatter

Information Retrieval and Web Intelligence

Frontmatter

Automatic Index Selection in RDBMS by Exploring Query Execution Plan Space

Abstract

A novel approach to solving Index Selection Problem (ISP) is presented. In contrast to other known ISP approaches, our method searches the space of possible query execution plans, instead of searching the space of index configurations. An evolutionary algorithm is used for searching. The solution is obtained indirectly as the set of indexes used by the best query execution plans. The method has important features over other known algorithms: (1) it converges to the optimal solution, unlike greedy heuristics, which for performance reasons tend to reduce the space of candidate solutions, possibly discarding optimal solutions; (2) though the search space is huge and grows exponentially with the size of the input workload, searching the space of the query plans allows to direct more computational power to the most costly plans, thus yielding very fast convergence to “good enough” solutions; and (3) the costly reoptimization of the workload is not needed for calculating the objective function, so several thousands of candidates can be checked in a second. The algorithm was tested for large synthetic and real-world SQL workloads to evaluate the performace and scalability.

Piotr Kołaczkowski, Henryk Rybiński

Integrated Retrieval from Web of Documents and Data

Abstract

The Semantic Web is evolving into a property-linked web of data, conceptually different from but contained in the Web of hyperlinked documents. Data Retrieval techniques are typically used to retrieve data from the Semantic Web while Information Retrieval techniques are used to retrieve documents from the Hypertext Web. We present a Unified Web model that integrates the two webs and formalizes connection between them. We then present an approach to retrieving documents and data that captures best of both the worlds. Specifically, it improves recall for legacy documents and provides keyword-based search capability for the Semantic Web. We specify the Hybrid Query Language that embodies this approach, and the prototype system SITAR that implements it. We conclude with areas of future work.

Krishnaprasad Thirunarayan, Trivikram Immaneni

Bipolar Queries: A Way to Enhance the Flexibility of Database Queries

Abstract

In many real life scenarios the use of standard query languages may be ineffective due to the difficulty to express the real user requirements (information needs). The use of fuzzy logic helps to fight this ineffectiveness making it possible to model and properly process linguistic terms in queries. This way a user may express his or her requirements in a more intuitive and flexible way. Recently another dimension of such a flexibility attracted the attention of many researchers. Namely, it is now widely advocated that by specifying his or her requirements the user is usually having in mind both negative and positive preferences. Thus, a combination of an intuitive appeal of natural language terms in queries with a bipolar nature of preferences seems to be a next promising step in enhancing the flexibility of queries.We look at various ways of how to understand bipolarity in database querying, propose fuzzy counterparts of some crisp approaches and study their properties.

Sławomir Zadrożny, Janusz Kacprzyk

On Deriving Data Summarization through Ontologies to Meet User Preferences

Abstract

A summary is a comprehensive description that grasps the essence of a subject. A text, a collection of text documents, a query answer can be summarized by simple means such as an automatically generated list of the most frequent words or ”advanced” by a meaningful textual description of the subject. In between these two extremes are summaries by means of selected concepts exploiting background knowledge providing selected key concepts. We address in this paper an approach where conceptual summaries are provided through a conceptualization as given by an ontology. The idea is to restrict a background ontology to the set of concepts that appears in the text to be summarized and therebyl provide a structure, a so-called instantiated ontology, that is specific to the domain of the text and can be used to condense to a summary not only quantitatively but also conceptually covers the subject of the text. In this chapter we introduce different approaches to summarization. We consider a strictly ontologly based approach where summaries are derived solely from the instantiated ontology, a conceptual clustering over the instantiated concepts based on a semantic similarity measure, and an approach based on probabilities.

Troels Andreasen, Henrik Bulskov

Granular Computing for Web Intelligence

Abstract

The World Wide Web, or simply the Web, is a large-scale and complex system that humans created in recent years. The Web brings opportunities and challenges for academic and industry communities and almost everyone on this planet as well. Due to its huge scale and complexity, one may find that it is impossible to search for simple theories and models for explaining the Web. Instead, more complicated theories and methodologies are needed, so that the Web can be examined from various perspectives. There are two purposes of the this chapter. One is to present an overview of the triarchic theory of granular computing, and the other is to examine granular computing perspectives on Web Intelligence (WI).

Yiyu Yao, Ning Zhong

Knowledge Discovery

Frontmatter

Visualizing High Dimensional Classifier Performance Data

Abstract

Classifier performance evaluation, which typically yields a vast number of results, may be approached as a problem of analyzing high dimensional data. Conducting an exploratory analysis of visual representations of this evaluation data enables us to exploit the advantages of the powerful human visual capabilities. This allows us to gain insight into the performance data, interact with it and draw meaningful conclusions about the classifiers and domains under study. We illustrate how visual techniques, based on a projection from a high dimensional space to a lower dimensional one, enable such an exploratory process. Moreover, this approach can be viewed as a generalization of conventional evaluation procedures based on point metrics that necessarily imply a higher loss of information. Finally, we show that within this framework, the user is able to study the evaluation data from a classifier point of view and from a domain point of view, which is infeasible with traditional evaluation methods.

Rocio Alaiz-Rodríguez, Nathalie Japkowicz, Peter Tischer

Extending Rule-Based Classifiers to Improve Recognition of Imbalanced Classes

Abstract

Knowledge discovery in general, and data mining in particular, have received a growing interest both from research and industry in recent years. Its main aim is to look for previously unknown relationships or patterns representing knowledge hidden in real-life data sets [16]. The typical representations of knowledge discovered from data are: associations, trees or rules, relational logic clauses, functions, clusters or taxonomies, or characteristic descriptions of concepts [16, 29, 21]. In this paper we focus on the rule-based representation. More precisely, we are interested in decision or classification rules that are considered in classification problems. In data mining other types of rules are also considered, e.g., association rules or action rules [16, 29, 34], however, in the text hereafter we will use the general term “rules” to refer specifically to decision rules.

Jerzy Stefanowski, Szymon Wilk

Converting between Various Sequence Representations

Abstract

This chapter is concerned with the organization of categorical sequence data. We first build a typology of sequences distinguishing for example between chronological sequences and sequences without time content. This permits to identify the kind of information that the data organization should preserve. Focusing then mainly on chronological sequences, we discuss the advantages and limits of different ways of representing time stamped event and state sequence data and present solutions for automatically converting between various formats, e.g., between horizontal and vertical presentations but also from state sequences into event sequences and reciprocally. Special attention is also drawn to the handling of missing values in these conversion processes.

Gilbert Ritschard, Alexis Gabadinho, Matthias Studer, Nicolas S. Müller

Considerations on Logical Calculi for Dealing with Knowledge in Data Mining

Summary

An attempt to develop and apply logical calculi in exploratory data analysis was made 30 years ago. It resulted in a definition and study of observational logical calculi based on modifications of classical predicate calculi and on mathematical statistics. Additional results followed the definition and first implementations of the GUHA method of mechanizing hypothesis formation. The GUHA method can be seen as one of the first data mining methods. Applications of modern and enhanced implementation of the GUHA method confirmed the generally accepted need to use domain knowledge in the process of data mining. Moreover it inspired considerations on the application of logical calculi for dealing with domain knowledge in data mining. This paper presents these considerations.

Jan Rauch

Information Integration and Data Security

Frontmatter

A Study on Recent Trends on Integration of Security Mechanisms

Abstract

Business solutions and security solutions are designed by different authorities at different coordinates of space and time. This engineering approach not only makes the lives of security and the business solution developers easy but also provide a proof of concept that the concerned business solution will have all the security features as expected.But it doesn’t provide a proof that the integration process will not lead to conflicts between the security features in the security solution and also between security features and the functional features of the business solution. For providing a conflict-free secured business solution, both the developers of security solution as well as of the secure business solution need a mechanism to identify all possible cases of conflicts, so that the developers can redesign the corresponding solutions and thus resolve the conflicts if any. Conflict arises due to different authorities and configuration and other resource sharing among the solutions under integration. In this chapter, we discuss conflicts during integration of security solutions with business solutions covering the wide spectrum of social, socio-technical and purely technical perspectives. The investigated recent approaches for automated detection of conflicts are also discussed in brief. The ultimate objective of the chapter is to discover the best suited approaches for detecting conflicts by software developers. It spans over approaches from cryptographic level to policy level weaving over the feature interaction problem typically suited for software systems. The assessment of these approaches is demonstrated by a remote healthcare application.

Paul El Khoury, Mohand-Saïd Hacid, Smriti Kumar Sinha, Emmanuel Coquery

Monitoring-Based Approach for Privacy Data Management

Abstract

This chapter addresses the problem of managing private data in service based applications ensuring end-to-end quality of service(QoS) capabilities. The proposed approach is processed through monitoring the compliance of privacy agreement that spells out a consumer’s privacy rights and how consumer private information must be handled by the service provider. A state machine based model is proposed to describe the Private Data Use Flow (PDUF) toward monitoring which can be used by privacy analyst to observe the flow and capture privacy vulnerabilities that may lead to non-compliance. The model is built on top of (i) properties and timed-related privacy requirements to be monitored that are specified using LTL (Linear Temporal Logic) (ii) a set of identified privacy misuses.

H. Meziane, S. Benbernou, F. Leymann, M. P. Papazoglou

Achieving Scalability with Schema-Less Databases

Abstract

Large enterprises continue to struggle with information and critical decision-making data being widely distributed, stored in a number of proprietary and heterogeneous formats, and remaining inaccessible for mining of critical information that spans the collected knowledge of the organization. NETMARK is an easy to use, scalable system for storing, decomposing, and indexing enterprise-wide information developed for NASA enterprise applications. Information is managed in a contextualized form, but one that is schema-less for immediate storage and retrieval without the need for a schema manager or database administrator. NETMARK is accessed via the WebDAV (HTTP) standard protocol for remote document management and a simple HTTP query algebra for immediate retrieval of information in an XML structured format for processing by applications such as Web 2.0 (AJAX) systems.

David A. Maluf, Christopher D. Knight

Intelligent Agents

Frontmatter

Managing Pervasive Environments through Database Principles: A Survey

Abstract

As initially envisioned by Mark Weiser, pervasive environments are the trend for the future of information systems. Heterogeneous devices, from small sensors to framework computers, are all linked though ubiquitous networks ranging from local peer-to-peer wireless connections to the world-wide Internet. Managing such environments, so as to benefit from its full potential of available resources providing information and services, is a challenging issue that covers several research fields like data representation, network management, service discovery. . . However, some issues have already been tackled independently by the database community, e.g. for distributed databases or data integration. In this survey, we analyze current trends in pervasive environment management through database principles and sketch the main components of our ongoing project SoCQ, devoted to bridging the gap between pervasive environments and databases.

Yann Gripay, Frédérique Laforest, Jean-Marc Petit

Toward a Novel Design of Swarm Robots Based on the Dynamic Bayesian Network

Abstract

In this chapter, we describe a novel design method of swarm robots based on the dynamic Bayesian network. Recently, an increasing attention has been paid to swarm robots due to their scalability, flexibility, cost-performance, and robustness. Designing swarm robots so that they exhibit intended collective behaviors is considered as the most challenging issue and so far ad-hoc methods which heavily rely on extensive experiments are common. Such a method typically faces a huge amount of data and handles them possibly using machine learning methods such as clustering.We argue, however, that a more principled use of data with a probabilistic model is expected to lead to a reduced number of experiments in the design and propose the fundamental part of the approach. A simple but a real example using two swarm robots is described as an application.

Einoshin Suzuki, Hiroshi Hirai, Shigeru Takano

Current Research Trends in Possibilistic Logic: Multiple Agent Reasoning, Preference Representation, and Uncertain Databases

Abstract

Possibilistic logic is a weighted logic that handles uncertainty, or preferences, in a qualitative way by associating certainty, or priority levels, to classical logic formulas. Moreover, possibilistic logic copes with inconsistency by taking advantage of the stratification of the set of formulas induced by the associated levels. Since its introduction in the mid-eighties, multiple facets of possibilistic logic have been laid bare and various applications addressed: handling exceptions in default reasoning, modeling belief revision, providing a graphical Bayesian-like network representation counterpart to a possibilistic logic base, representing positive and negative information in a bipolar setting with applications to preferences fusion and to version space learning, extending possibilistic logic for dealing with time, or multiple agents mutual beliefs, developing a symbolic treatment of priorities for handling partial orders between levels and also improving computational efficiency, learning stratified hypotheses for coping with exceptions. The chapter aims primarily at offering an introductory survey of possibilistic logic developments. Still, it also outlines new research trends that are relevant in preference representation, or in reasoning about epistemic states.

Henri Prade

Data Management in Medical Domain

Frontmatter

Atherosclerosis Risk Assessment Using Rule-Based Approach

Abstract

A number of calculators that compute the risk of atherosclerosis has been developed and made available on the Internet. They all are based on computing weighted sum of risk factors. We propose instead to use more flexible rule-based approach to estimate this risk. The used rules were created using machine learning methods and further refined by domain expert. Using our rule-based expert system NEST, we built a consultation module AtherEx, that helps (via Internet) a non-expert user to evaluate his atherosclerosis risk.

Petr Berka, Marie Tomečková

Interpretation of Imprecision in Medical Data

Abstract

Imprecision is an intrinsic part of all data types and even more so of medical data. In this paper, we revisit the definition of imprecision as well as closely related concepts of incompleteness, uncertainty, inaccuracy, and, in general, imperfection of data. We examine the traditional hierarchical approach to data, information, and knowledge in the context of medical data, which is characterized by heterogeneity, variable granularity and time-dependency. We observe that (1) imprecision has syntactic, semantic, and pragmatic aspects and (2) imprecision has its spectrum from most precise to most imprecise and unknown. We argue that interpretation of imprecision is highly contextual, and, furthermore, that medical data cannot be decoupled from their meanings and their intended usage. To address the contextual interpretation of imprecision, we present a framework for knowledge-based modeling of medical data, which comprises a semiotic approach, a fuzzy-logic approach, and a multidimensional approach.

Mila Kwiatkowska, Peter Riben, Krzysztof Kielan

Promoting Diversity in Top Hits for Biomedical Passage Retrieval

Abstract

With the volume of biomedical literature exploding, such as BMC or PubMed, it is of paramount importance to have scalable passage retrieval systems that allow researchers to quickly find desired information. While topical relevance is the most important factor in biomedical text retrieval, an effective retrieval system needs to also cover diverse aspects of the topic. Aspect-level performance means that top-ranked passages for a topic should cover diverse aspects. Aspect-level retrieval methods often involve clustering the retrieved passages on the basis of textual similarity. We propose the HIERDENC text retrieval system that ranks the retrieved passages, achieving scalability and improved aspect-level performance over other clustering methods. HIERDENC runtimes scale on large datasets, such as PubMed and BMC. The HIERDENC aspect-level performance is consistently better than cosine similarity and Hamming Distance-based clustering methods. HIERDENC is comparable to biclustering separation of relevant passages, and improves on topics where many aspects are involved. Converting textual passages to GO/MeSH ontological terms improves the HIERDENC aspect-level performance.

Bill Andreopoulos, Xiangji Huang, Aijun An, Dirk Labudde, Qinmin Hu

Backmatter

Title: Advances in Data Management
Editors: Zbigniew W. Ras
Agnieszka Dardzinska
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-02190-9
Print ISBN: 978-3-642-02189-3
DOI: https://doi.org/10.1007/978-3-642-02190-9

Springer Professional

Advances in Data Management

About this book

Table of Contents

Frontmatter

Information Retrieval and Web Intelligence

Frontmatter

Automatic Index Selection in RDBMS by Exploring Query Execution Plan Space

Integrated Retrieval from Web of Documents and Data

Bipolar Queries: A Way to Enhance the Flexibility of Database Queries

On Deriving Data Summarization through Ontologies to Meet User Preferences

Granular Computing for Web Intelligence

Knowledge Discovery

Frontmatter

Visualizing High Dimensional Classifier Performance Data

Extending Rule-Based Classifiers to Improve Recognition of Imbalanced Classes

Converting between Various Sequence Representations

Considerations on Logical Calculi for Dealing with Knowledge in Data Mining

Information Integration and Data Security

Frontmatter

A Study on Recent Trends on Integration of Security Mechanisms

Monitoring-Based Approach for Privacy Data Management

Achieving Scalability with Schema-Less Databases

Intelligent Agents

Frontmatter

Managing Pervasive Environments through Database Principles: A Survey

Toward a Novel Design of Swarm Robots Based on the Dynamic Bayesian Network

Current Research Trends in Possibilistic Logic: Multiple Agent Reasoning, Preference Representation, and Uncertain Databases

Data Management in Medical Domain

Frontmatter

Atherosclerosis Risk Assessment Using Rule-Based Approach

Interpretation of Imprecision in Medical Data

Promoting Diversity in Top Hits for Biomedical Passage Retrieval

Backmatter

Premium Partner