Skip to main content

2013 | Buch

Advances in Databases and Information Systems

herausgegeben von: Tadeusz Morzy, Theo Härder, Robert Wrembel

Verlag: Springer Berlin Heidelberg

Buchreihe : Advances in Intelligent Systems and Computing

insite
SUCHEN

Über dieses Buch

This volume is the second one of the 16th East-European Conference on Advances in Databases and Information Systems (ADBIS 2012), held on September 18-21, 2012, in Poznań, Poland. The first one has been published in the LNCS series.

This volume includes 27 research contributions, selected out of 90. The contributions cover a wide spectrum of topics in the database and information systems field, including: database foundation and theory, data modeling and database design, business process modeling, query optimization in relational and object databases, materialized view selection algorithms, index data structures, distributed systems, system and data integration, semi-structured data and databases, semantic data management, information retrieval, data mining techniques, data stream processing, trust and reputation in the Internet, and social networks. Thus, the content of this volume covers the research areas from fundamentals of databases, through still hot topic research problems (e.g., data mining, XML data processing), to novel research areas (e.g., social networks, trust and reputation, and data stream processing). The editors of this volume believe that its content will inspire the researchers with new ideas for future development. It may also serve as an overview of the ongoing work in the field of databases and information systems.

Inhaltsverzeichnis

Frontmatter
Reduction Relaxation in Privacy Preserving Association Rules Mining
Abstract
In Privacy Preserving Association Rules Mining, when frequent sets are discovered, the relaxation can be used to decrease the false negative error component and, in consequence, to decrease the number of true frequent itemsets that are missed. We introduce the new type of relaxation - the reduction relaxation that enable a miner to decrease and control the false negative error for different lengths of frequent itemsets.
Piotr Andruszkiewicz
Decomposition of Non-deterministic Services for the Purpose of Replication
Abstract
Replication can improve performance, availability and reliability of services. However, its application raises several design and implementation problems. One of them is non-determinism of processing on replicas. We propose a “design pattern” for structuring the service so that it is possible to overcome the problem.
Marcin Bazydło, Szymon Francuzik, Cezary Sobaniec, Dariusz Wawrzyniak
Partitioning Approach to Collocation Pattern Mining in Limited Memory Environment Using Materialized iCPI-Trees
Abstract
Collocation pattern mining is one of the latest data mining techniques applied in Spatial Knowledge Discovery. We consider the problem of executing collocation pattern queries in a limited memory environment. In this paper we introduce a new method based on iCPI-tree materialization and a spatial partitioning to efficiently discover collocation patterns. We have implemented this new solution and conducted series of experiments. The results show a significant improvement in processing times both on synthetic and real world datasets.
Pawel Boinski, Maciej Zakrzewicz
Towards the Automated Business Model-Driven Conceptual Database Design
Abstract
The paper presents an UML-business-model-driven approach to the automated design of the initial conceptual database model. The proposed approach is based on: (i) extraction of characteristic concepts from the business model consisting of a finite set of UML activity diagrams representing the business processes of the whole enterprise, and (ii) automated generation of the UML class diagram representing the target conceptual model. The implemented two-phase automatic generator iterates through the source business model, processes all activity diagrams and generates the target class diagram. The classes are being generated in the first phase, while the class associations are being added in the second phase. The application of the implemented generator is illustrated on the simplified real business model.
Drazen Brdjanin, Slavko Maric
Aligning Business Process Models and Domain Knowledge: A Meta-modeling Approach
Abstract
In recent years the problems related to modeling and improving business processes have been of growing interest. Indeed, companies are realizing the undeniable impact of a better understanding and management of business processes (BP) on the effectiveness, consistency, and transparency of their business operations. BP modeling aims at a better understanding of processes, allowing deciders to achieve strategic goals of the company. However, inexperienced systems analysts often lack domain knowledge leading and this affects the quality of models they produce. In this paper we propose to support this modeling effort with an approach that uses domain knowledge to improve the semantic quality of BP models. This approach relies on domain ontologies as a mean to capture domain knowledge and on meta-modeling techniques. The main contribution of this paper is threefold: 1) the metamodels describing both a domain ontology and a BP model are described, 2) the alignment between the concepts of both meta-models is defined and illustrated, 3) a set of OCL mapping rules is provided. A simple case study illustrates the process.
Samira Si-Said Cherfi, Sarah Ayad, Isabelle Comyn-Wattiau
Optimization of Object-Oriented Queries through Pushing Selections
Abstract
We propose a new optimization method for object-oriented queries. The method enables pushing selection conditions before structure constructors, joins and quantifiers. A less general variant of this method is known from relational systems and SQL as pushing a selection before a join. If a query involves a selection which predicates refer to proper arguments of a structure constructor or a join or a quantifier that occurs at the left-hand subquery of this query then it can be rewritten. The rewriting consists in pushing such predicates down before a proper operator. The approach follows the stack-based approach (SBA) and its query language SBQL (Stack-Based Query Language). The assumed strong typing based on the compile time simulation of run-time actions gives the possibility to solve this optimization problem in its full generality. The paper presents examples how the optimization method works. General features of the implemented algorithm are also presented.
Marcin Drozd, Michał Bleja, Krzysztof Stencel, Kazimierz Subieta
On Parallel Sorting of Data Streams
Abstract
Since the development of applications for parallel architectures is complicated and error-prone, many frameworks were created to simplify this task. One promising approach which is applicable especially for the development of parallel databases is expressing algorithms as stream programs, i.e. inputs and outputs of procedures are data streams and these procedures are connected so that they form an oriented graph. In this paper, we introduce highly scalable sorting algorithm which is suitable for streaming systems. We achieve this mainly by introducing multiway merge algorithm which is able to merge multiple independent sorted streams in parallel.
Zbyněk Falt, Jan Bulánek, Jakub Yaghob
Mining Association Rules from Database Tables with the Instances of Simpson’s Paradox
Abstract
This paper investigates a problem of mining association rules (ARs) from database tables in the case of the occurrence of Simpson’s paradox. Firstly, the paper reports that it is impossible to mine reliable association rules using solely objective, data-based evaluation measures. The importance of the problem comes from the fact that in non-experimental environments, e.g. in medicine or economy, the Simpson’s paradox is likely to occur and difficult to overcome by the controlled acquisition of data. This paper proposes a new approach that exploits the supplementary knowledge during the selection of ARs, and thus overcomes the presence of Simpson’s paradox. In the experimental part, the paper identifies the problem in exemplary real-world data and shows how the proposed approach can be used in practice.
Wojciech Froelich
Synchronization Modeling in Stream Processing
Abstract
Currently used latency models in stream databases are based on the average values analysis that results from Little’s law. The other models apply theory of M/G/1 queuing system. Theses solutions are fast and easy to implement but they omit the impact of streams synchronization. In this paper, we introduce a heuristic method which measures the synchronization impact. Then we have used this solution to extend the popular model based on average values analysis. This modification allows us to achieve better accuracy of latency estimation. Because schedulers and stream operator optimization require a fast and accurate model, we find our model a good starting point to create better optimizers.
Marcin Gorawski, Aleksander Chrószcz
Towards Automated Enterprise Architecture Documentation: Data Quality Aspects of SAP PI
Abstract
Well executed, Enterprise Architecture (EA) management is commonly perceived as a strategic advantage. EA management sermonizes IT savvy firms to take profound decisions based on mature EA information. As of today, gathering required information, i.e. documenting the EA, is regarded both, time consuming and error-prone. As a reaction, recent approaches seek to automate EA documentation by extracting information out of productive system environments. In our recentwork we showed that a particular Enterprise Service Bus (ESB) namely SAP Process Integration can be used to extract EA relevant information. As a next step towards automated EA documentation, this paper analyzes the quality of the data stored in SAP Process Integration systems in practice. Survey results of 19 industry partners on 4 continents are presented.
Sebastian Grunow, Florian Matthes, Sascha Roth
Two-Stage Stochastic View Selection for Data-Analysis Queries
Abstract
We consider the problem of selecting an optimal set of views to answer a given collection of queries at the present time (stage 1) as well as several collections of queries in the future (stage 2), with a given probability of occurrence associated with each collection, so as to minimize the expected value of the corresponding query response time, while keeping the total size of the views within a given limit. We formulate this problem as a two-stage stochastic programming problem. We show that this model is equivalent to an integer programming (IP) model that can be solved via various commercial IP solvers. We also study the relationship between the queries and the views in this context and use this relationship to reduce the size of the corresponding IP model, hence increase the scalability of our proposed approach.
Rong Huang, Rada Chirkova, Yahya Fathi
Choosing Values for Text Fields in Web Forms
Abstract
Since the only way to gain access to Hidden Web data is through form submission, one of the challenges is how to fill Web forms automatically. In this paper, we propose algorithms which address this challenge.We describe an efficient method to select good values for text fields and a technique which minimizes the number of form submissions and simultaneously maximizes the number of rows retrieved from the underlying database. Experiments using real Web forms show the advantages of our proposed approaches.
Gustavo Zanini Kantorski, Tiago Guimaraes Moraes, Viviane Pereira Moreira, Carlos Alberto Heuser
A Relative Feature Selection Algorithm for Graph Classification
Abstract
Graph classification is one of the most important research fields in data mining nowadays and many algorithms have been proposed to address this issue. In practice, labeling large or even medium-size graphs is a hard task and we need experts to do so. The biggest challenge in graph classification is extracting a set of proper features from graphs. Since graphs are represented by a complex data structure, this issue has been dealt with for a long time. Previous methods focused on extracting features from a certain class in a dataset. In this paper we propose a new feature selection method that extracts features from each graph rather than extracting them from a certain class in the dataset. We extract only frequent subgraphs as features. These subgraphs are chosen according to their number of occurrences in a graph. Moreover, we proposed a new formula which calculates the minimum number of occurrences required for a subgraph to be considered as frequent.We experimented on five real datasets and reached 7-17% higher accuracy than previously proposed methods.
Yaser Keneshloo, Sasan Yazdani
Generic Conceptual Framework for Handling Schema Diversity during Source Integration
Semantic Database Case Study
Abstract
DataBase Integration Systems (DIS) aim at providing a unified view of data stored in different heterogeneous local sources through a global schema. The schemas of global and local sources are represented by a − priori known logical representations. This makes the potential deployment model of the DIS, like for Data Warehouse (DW) systems, rigid and inflexible. To overcome these limitations, we claim that the integration process should be completely performed at the conceptual level independently of any implementation constraint. After studying existing database (DB) integration systems, we propose through this paper a generic conceptual framework for DB schema integration. This framework is generic as it subsumes most important DB integration systems studied. It is defined based on the description logic formalism. We then instantiate it through a case study considering a set of Oracle semantic DB, that are DB storing their own conceptual model, generated using Lehigh University BenchMark (LUBM). These DB participate in the construction of a semantic DW.
Selma Khouri, Ladjel Bellatreche, Nabila Berkani
Adaptation of Asymptotic Trust Algorithm
Reputation on Internet Forums
Abstract
Internet is a great medium that allows us to share knowledge with people from all over the world. On the other hand, information from the internet is unreliable because one cannot tell whether it was provided by an expert or a novice. In many cases it does not matter because everybody has the right to his/her opinion about for example a video clip. However there are situations when the credibility of the information is vitally important. Often one wants to find a solution to an important problem, therefore she/he needs some method to ignore jokes, spam and wrong answers by unexperienced users and consider only credible answers.We have designed Asymptotic Trust Algorithm (ATA) to manage users’ reputation in customer-to-customer environment like auction sites. In this paper we present an adaptation of ATA which allows to managed reputation of users of internet forums. We belive that our method will be useful for every internet forum which cares about reliability.
Konrad Leszczyński, Maciej Zakrzewicz
PropScale: An Update Propagator for Joint Scalable Storage
Abstract
In the era of Web 2.0 and the apparent dawn of Web 3.0 web pages are dynamic and personalized. As the result, the load of web servers rapidly increases. Moreover, the upcoming load boost is impossible to predict. Although deceptively funny, the term of success − tolerant architectures has been coined. A number of web services actually failed because of their initial success. In order to achieve success-tolerance the server architectures must be scalable. Nowadays almost all components of systems can certainly be multiplied. The only exception is the storage constituent. The usual solution with one strong relational database is unsatisfactory. Thus, designers introduce additional (NO)SQL storage facilities. From this point one has a number of separate data sources that can apparently get inconsistent with each other. Special software must be developed to synchronize them. This means more bugs to fix, more code to maintain and more money to spend. In this paper we present a new technique to introduce a number of non-homogenous storage units into a system. The solution consists of an algorithm that propagates updates among disparate (NO)SQL storages built into a system.
Paweł Leszczyński, Krzysztof Stencel
An Indexing Structure for Dynamic Multidimensional Data in Vector Space
Abstract
The multidimensional kNN (k nearest neighbors) query problem is relevant to a large variety of database applications, including information retrieval, natural language processing, and data mining. To solve it efficiently, the database needs an indexing structure that provides this kind of search. However, attempts to find an exact solution are hardly feasible in multidimensional space. In this paper, a novel indexing technique for the approximate solution of kNN problem is described and analyzed. The construction of the indexing tree is based on clustering. Indexing structure is implemented on top of high-performance industrial DBMS.
Elena Mikhaylova, Boris Novikov, Anton Volokhov
Social Network Analysis on Highly Aggregated Data: What Can We Find?
Abstract
Social network analysis techniques have been often used to derive useful knowledge from email and communication networks. However, most previous works considered an ideal scenario when full raw data were available for analysis. Unfortunately, such data raise privacy issues, and are often considered too valuable to be disclosed. In this paper we present the results of social network analysis of a very large volume of the telecommunication data acquired from a mobile phone operator. The data are highly aggregated, with only limited amount of information about individual connections between users. We show that even with such limited data, social network analysis methods provide valuable insights into the data and can reveal interesting patterns.
Mikołaj Morzy, Kamil Forenc
An Adaptive Navigation Method for Semi-structured Data
Abstract
The navigation adaptation is the solution that supports the user during his interaction with the system. In the literature, several works that deal with the navigation adaptation are proposed. They guide the user from a document to another, provide the user with a set of links leading to the pertinent documents, or apply on simple links the suitable adaptive navigation technologies. In this paper, we contribute to propose a method that identifies the best navigation path between semi-structured result documents according to the user’s needs, history and device.
Rim Zghal Rebai, Corinne Amel Zayani, Ikram Amous
Relational Schema Summarization: A Context-Oriented Approach
Abstract
Query a database by users unfamiliar with its schema can be a challenging test due mainly to the difficulty of understanding dozens or more of possibly poorly designed inter-linked tables, beyond outdated or missing documentation (usability problem). Such users include database developers and sophisticated users: they may eventually need to acquire detailed knowledge of the schema, and then their ability to do so would be greatly improved if they could start with a simplified, easy-to-read schema. Simplified and easy-toread schemas have been studied within a research direction called database schema summarization [8, 10, 11, 12, 13].
Marcus Sampaio, Jefferson Quesado, Samarony Barros
Semantic Approach to Cluster Validity Notion
Abstract
In our research we formulate new concepts of the cluster quality based on semantic point of view. In the presented cluster validity approaches quality of clustering is measured according to correspondence between dataset and cluster structure or some cluster structure properties. Cluster semantic and user interests are not considered. We present a semantic approach to cluster validity and a methodology of its evaluation.
Elena Sivogolovko, Bernhard Thalheim
An Automated Approach of Designing Multiplex PCR Primers for the Amplification of Exons
Abstract
This paper presents a new program to design specific and reliable primers for amplification of exons in a multiplex PCR. The program is composed of two components: a data acquisition component and a data processing component. The data acquisition component automates time–consuming steps of preparing data for the primers design algorithms. It provides up-to-date information about selected genes including coding sequences and locations of SNPs. The data processing component automates the process of grouping candidate primers into an optimal set for the multiplex PCR by checking their specificity. It is an original and unconventional method that combines the hierarchical clustering and the separate-and-conquer strategy. The results obtained for various genes and presented in this paper prove that the proposed method is an effective way to design an optimal set of primers for the multiplex PCR.
Adam Skowron, Rafal Pokrzywa
Functional Dependencies on Symbol Strings Generated by Extended Context Free Languages
Abstract
In this paper, we first rephrase the notion of regular functional dependency. The original definition based upon the dual language emerging from a regular grammar. We define now regular functional dependencies on finite state automata constructed from regular expressions. We extend the definition of regular functional dependency to extended context free languages. We define the syntactical form of functional dependencies on the graph of the FSA, constructed from the regular expression denoting the right side of the production rules. The semantics of the functional dependency will be given on the generated language. Using this model we can handle extended relations generated by recursive regular expressions too. The implication problem of our class of dependencies is decidable by a version of Chase algorithm specified on the graph of the associated FSA.
Gyula I. Szabó, András Benczúr
Extending HQL with Plain Recursive Facilities
Abstract
The mismatch between relational databases and object-oriented programming languages has been significantly mitigated by the use of object-relational mapping. However, the querying facilities available in such mapping systems are still inferior when compared to the features of a fully-fledged relational DBMS. In our research we aim at enriching object-relation mapping with advanced database concepts. An example of such an aspect is recursive querying. In prequel papers we have shown how to extend Hibernates mapping configurations with comprehendible recursive views that map to SQL common table expressions. In this paper we show how one can extend Hibernate Query Language (HQL) with plain recursive query facilities based on Oracles CONNECT BY phrase. Although, unfortunately it has not become a part of the SQL standard, its properties like cleanness and guaranteed stop make it worth exploiting.We propose adding CONNECT BY to HQL.We have implemented a prototype mapping of this phrase to recursive queries in major DBMSs and tested its efficiency. As the result we have got a simple, safe and fast way to pose recursive queries in HQL.
Aneta Szumowska, Marta Burzańska, Piotr Wiśniewski, Krzysztof Stencel
Trust in RDF Graphs
Abstract
It is well-known that pure Resource Description Framework (RDF) is not suitable to represent trust information. This work is to overcome these limitations. In our proposal RDF graphs are extended by trust metrics. The paper describes a mechanism for representing and reasoning with trust annotated RDF data. We present how with metric algebra such an annotation can be used for processing data on inferred RDF triples.
Dominik Tomaszuk, Karol Pąk, Henryk Rybiński
Inference of XML Integrity Constraints
Abstract
In this paper we expand upon the previous efforts to infer schema information from existing XML documents. We focus on inference of integrity constraints, more specifically ID/IDREF/IDREFS attributes in DTD. Building on the research by Barbosa and Mendelzon (2003) we introduce a heuristic approach to the problem of finding an optimal ID set. The approach is evaluated and tuned in a wide range of experiments.
Matej Vitásek, Irena Mlýnková
Optimizing the Resource Allocation for Approximate Query Processing
Abstract
Query optimization techniques are a proven tool essential for high performance of the database management systems. However, in a context of data spaces or new querying paradigms, such as similarity based search, exact query evaluation is neither computationally feasible nor meaningful and approximate query evaluation is the only reasonable option. In this paper a problem of resource allocation for approximate evaluation of complex queries is considered and an approximate algorithm for an optimal resource allocation is presented, providing the best feasible quality of the output result subject to a limited total cost of a query.
Anna Yarygina, Boris Novikov
Backmatter
Metadaten
Titel
Advances in Databases and Information Systems
herausgegeben von
Tadeusz Morzy
Theo Härder
Robert Wrembel
Copyright-Jahr
2013
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-32741-4
Print ISBN
978-3-642-32740-7
DOI
https://doi.org/10.1007/978-3-642-32741-4