nach oben

2018 | Buch

Database and Expert Systems Applications

DEXA 2018 International Workshops, BDMICS, BIOKDD, and TIR, Regensburg, Germany, September 3–6, 2018, Proceedings

herausgegeben von: Prof. Mourad Elloumi, Michael Granitzer, Abdelkader Hameurlain, Christin Seifert, Prof. Dr. Benno Stein, A Min Tjoa, Prof. Dr. Roland Wagner

Verlag: Springer International Publishing

Buchreihe : Communications in Computer and Information Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This volume constitutes the refereed proceedings of the three workshops held at the 29th International Conference on Database and Expert Systems Applications, DEXA 2018, held in Regensburg, Germany, in September 2018: the Third International Workshop on Big Data Management in Cloud Systems, BDMICS 2018, the 9th International Workshop on Biological Knowledge Discovery from Data, BIOKDD, and the 15th International Workshop on Technologies for Information Retrieval, TIR.
The 25 revised full papers were carefully reviewed and selected from 33 submissions. The papers discuss a range of topics including: parallel data management systems, consistency and privacy cloud computing and graph queries, web and domain corpora, NLP applications, social media and personalization

Inhaltsverzeichnis

Frontmatter

Big Data Management in Cloud Systems (BDMICS)

Frontmatter

A Survey on Parallel Database Systems from a Storage Perspective: Rows Versus Columns

Abstract

Big data requirements have revolutionized database technology, bringing many innovative and revamped DBMSs to process transactional (OLTP) or demanding query workloads (cubes, exploration, pre-processing). Parallel and main memory processing have become important features to exploit new hardware and cope with data volume. With such landscape in mind, we present a survey comparing modern row and columnar DBMSs, contrasting their ability to write data (storage mechanisms, transaction processing, batch loading, enforcing ACID) and their ability to read data (query processing, physical operators, sequential vs parallel). We provide a unifying view of alternative storage mechanisms, database algorithms and query optimizations used across diverse DBMSs. We contrast the architecture and processing of a parallel DBMS with an HPC system. We cover the full spectrum of subsystems going from storage to query processing. We consider parallel processing and the impact of much larger RAM, which brings back main-memory databases. We then discuss important parallel aspects including speedup, sequential bottlenecks, data redistribution, high speed networks, main memory processing with larger RAM and fault-tolerance at query processing time. We outline an agenda for future research.

Carlos Ordonez, Ladjel Bellatreche

ThespisDIIP: Distributed Integrity Invariant Preservation

Abstract

Thespis is a distributed database middleware that leverages the Actor model to implement causal consistency over an industry-standard DBMS, whilst abstracting complexities for application developers behind a REST open-protocol interface. ThespisDIIP is an extension that treats the concept of integrity invariance preservation for the class of problems where value changes must be satisfied according to a Linear Arithmetic Inequality constraint. An example of this constraint is a system enforcing a constraint that a transaction is only accepted if there are sufficient funds in a bank account. Our evaluation considers correctness, performance and scalability aspects of ThespisDIIP. We also run empirical experiments using YCSB to show the efficacy of the approach for a variety of workloads and a number of conditions, determining that integrity invariants are preserved in a causally-consistent distributed database, whilst minimising latency in the user’s critical path.

Carl Camilleri, Joseph G. Vella, Vitezslav Nezval

Privacy Issues for Cloud Systems

Abstract

In this paper we discuss the issue of privacy in the cloud. In the area of privacy in the cloud we take a first look, which components take part in a cloud privacy system. This starts with influencing factors in hardware production, customer and provider privacy. With the cloud privacy it is valuable to handle Privacy as a Service (PaaS) with its security protocols. Not only is it essential to include trusted third parties, the cloud providers itself have to be strict with their code of conduct in the cloud (cloud of conduct). The fast paste economy has brought up ethic issues over the years. There are different kinds of cloud types which more or less harmonize with privacy ethics. These topics need to be viewed in the context of a cloud privacy system.

Christopher Horn, Marina Tropmann-Frick

Script Based Migration Toolkit for Cloud Computing Architecture in Building Scalable Investment Platforms

Abstract

The 2008 Financial Crisis which created a global financial market meltdown is mainly due to badly structured mortgage loans with poor or subpar credit quality and lack of proper tools to measure portfolio risks by the lenders. Even though several problems led to this crisis, we looked at this from a Big Data. Had the infrastructure and analytical analysis tools were present to the lenders, they would have found the various early warning signs on these mortgage loans and could have better prepared for the crisis. Aftermath of the crisis, all the big financial institutions took a fresh look and embarked onto build various tools and frameworks to address this Big Data in their portfolios with data driven analysis. The 3Vs (Velocity, Volume and Variety) of the Big Data in our Mortgage Loan Analysis System challenges our traditional approach in collecting, processing and presenting the individual and aggregated loan level data in a meaningful format to facilitate our portfolio managers in decision making. The traditional methods are implemented on a standalone on-premises SQL server. Our Framework creates the foundation of migrating from traditional standalone database architecture (on-premises) to Cloud Computing environment using “Script Based Implementation”. The methods we present are simple but effective and saves resources in terms of Hardware, Software and on-going maintenance costs. Big Data “Capture, Transform, Calculate and Visualize” (CTCV) implementation takes a phased approach rather than a big bang model. Our implementation helps the Big Data Management to be part of organizational tool kit. This saves hard dollars and brings us in line with the overall firm strategic vision of moving to Cloud Computing for Investment Management Services.

Rao Casturi, Rajshekhar Sunderraman

Space-Adaptive and Workload-Aware Replication and Partitioning for Distributed RDF Triple Stores

Abstract

The efficient distributed processing of big RDF graphs requires typically decreasing the communication cost over the network. This requires on the storage level both a careful partitioning (in order to keep the queried data in the same machine), and a careful data replication strategy (in order to enhance the probability of a query finding the required data locally). Analyzing the collected workload trend can provide a base to highlight the more important parts of the data set that are expected to be targeted by future queries. However, the outcome of such analysis is highly affected by the type and diversity of the collected workload and its correlation with the used application. In addition, the replication type and size are limited by the amount of available storage space. Both of the two main factors, workload quality and storage space, are very dynamic on practical system. In this work we present our adaptable partitioning and replication approach for a distributed RDF triples store. The approach enables the storage layer to adapt with the available size of storage space and with the available quality of workload aiming to give the most optimized performance under these variables.

Ahmed Al-Ghezi, Lena Wiese

Performance Comparison of Three Spark-Based Implementations of Parallel Entity Resolution

Abstract

During the last decade, several big data processing frameworks have emerged enabling users to analyze large scale data with ease. With the help of those frameworks, people are easier to manage distributed programming, failures and data partitioning issues. Entity Resolution is a typical application that requires big data processing frameworks, since its time complexity increases quadratically with the input data. In recent years Apache Spark has become popular as a big data framework providing a flexible programming model that supports in-memory computation. Spark offers three APIs: RDDs, which gives users core low-level data access, and high-level APIs like DataFrame and Dataset, which are part of the Spark SQL library and undergo a process of query optimization. Stemming from their different features, the choice of API can be expected to have an influence on the resulting performance of applications. However, few studies offer experimental measures to characterize the effect of such distinctions. In this paper we evaluate the performance impact of such choices for the specific application of parallel entity resolution under two different scenarios, with the goal to offer practical guidelines for developers.

Xiao Chen, Kirity Rapuru, Gabriel Campero Durand, Eike Schallehn, Gunter Saake

Big Data Analytics: Exploring Graphs with Optimized SQL Queries

Abstract

Nowadays there is an abundance of tools and systems to analyze large graphs. In general, the goal is to summarize the graph and discover interesting patterns hidden in the graph. On the other hand, there is a lot of data stored on DBMSs that can be potentially analyzed as graphs. External graph data sets can be quickly loaded. It is feasible to load data quickly and that SQL can help prepare graph data sets from raw data. In this paper, we show SQL queries on a graph stored in relational form as triples can reveal many interesting properties and patterns on the graph in a more flexible manner and efficient than existing systems. We explain many interesting statistics on the graph can be derived with queries combining joins and aggregations. On the other hand, linearly recursive queries can summarize interesting patterns including reachability, paths, and connected components. We experimentally show exploratory queries can be efficiently evaluated based on the input edges and it performs better than Spark. We also show that skewed degree vertices, cycles and cliques are the main reason exploratory queries become slow.

Sikder Tahsin Al-Amin, Carlos Ordonez, Ladjel Bellatreche

Biological Knowledge Discovery from Big Data (BIOKDD)

Frontmatter

New Modeling Ideas for the Exact Solution of the Closest String Problem

Abstract

In this paper we consider the exact solution of the closest string problem (CSP). In general, exact algorithms for an NP-hard problem are either branch and bound procedures or dynamic programs. With respect to branch and bound, we give a new Integer Linear Programming formulation, improving over the standard one, and also suggest some combinatorial lower bounds for a possible non-ILP branch and bound approach. Furthermore, we describe the first dynamic programming procedure for the CSP.

Marcello Dalpasso, Giuseppe Lancia

Ensemble Clustering Based Dimensional Reduction

Abstract

Distance metric over a given space of data should reflect the precise comparison among objects. The Euclidean distance of data points represented by a large number of features is not capturing the actual relationship between those points. However, objects of similar cluster both often have some common attributes despite the fact that their geometrical distance could be somewhat large. In this study, we proposed a new method that replaced the given data space to categorical space based on ensemble clustering (EC). The EC space is defined by tracking the membership of the points over multiple runs of clustering algorithms. To assess our suggested method, it was integrated within the framework of the Decision Trees, K Nearest Neighbors, and the Random Forest classifiers. The results obtained by applying EC on 10 datasets confirmed that our hypotheses embedding the EC space as a distance metric, would improve the performance and reduce the feature space dramatically.

Loai Abddallah, Malik Yousef

Detecting Low Back Pain from Clinical Narratives Using Machine Learning Approaches

Abstract

Free-text clinical notes recorded during the patients’ visits in the Electronic Medical Record (EMR) system narrates clinical encounters, often using ‘SOAP’ notes (an acronym for subject, objective, assessment, and plan). The free-text notes represent a wealth of information for discovering insights, particularly in medical conditions such as pain and mental illness, where regular health metrics provide very little knowledge about the patients’ medical situations and reactions to treatments. In this paper, we develop a generic text-mining and decision support framework to diagnose chronic low back pain. The framework utilizes open-source algorithms for anonymization, natural language processing, and machine learning to classify low back pain patterns from unstructured free-text notes in the Electronic Medical Record (EMR) system as noted by the primary care physicians during patients’ visits. The initial results show a high accuracy for the limited thirty-four patient labelled data set that we used in this pilot study. We are currently processing a larger data set to test our approach.

Michael Judd, Farhana Zulkernine, Brent Wolfrom, David Barber, Akshay Rajaram

Classifying Big DNA Methylation Data: A Gene-Oriented Approach

Abstract

Thanks to Next Generation Sequencing (NGS) techniques, public available genomic data of cancer is growing quickly. Indeed, the largest public database of cancer called The Cancer Genome Atlas (TCGA) contains huge amounts of biomedical big data to be analyzed with advanced knowledge extraction methods. In this work, we focus on the NGS experiment of DNA methylation, whose data matrices are composed of hundred thousands of features (i.e., methylated sites). We propose an efficient data processing procedure that permits to obtain a gene-oriented organization and enables to perform a supervised machine learning analysis with state-of-the-art methods. The procedure divides the original data matrices into several sub-matrices, each one containing the sites located within the same gene. We extract from TCGA DNA methylation data of three tumor types (i.e., breast, prostate, and thyroid carcinomas) and we are able to successfully discriminate tumoral from non tumoral samples using function-, tree-, and rule-based classifiers. Finally, we select the best performing genes (matrices) ranking them according to the accuracy of the classifiers and we execute an enrichment analysis of them. Those genes can be further investigated by domain experts for proving their relation to the cancers under study.

Emanuel Weitschek, Fabio Cumbo, Eleonora Cappelli, Giovanni Felici, Paola Bertolazzi

Classifying Leukemia and Gout Patients with Neural Networks

Abstract

Machine Learning is one of the top growing fields of recent times and is applied in various areas such as healthcare. In this article, machine learning is used to study the patients suffering from either gout or leukemia, but not both, with the use of their uric acid signatures. The study of the uric acid signatures involves the application of supervised machine learning, using an artificial neural network (ANN) with one hidden layer and sigmoid activation function, to classify patients and the calculation of the accuracy with k-fold cross validation. We identify the number of nodes in the hidden layer and a value for the weight decay parameter that are optimal in terms of accuracy and ensure good performance.

Guryash Bahra, Lena Wiese

Incremental Wrapper Based Random Forest Gene Subset Selection for Tumor Discernment

Abstract

High-dimensional cancer related dataset permits the researchers to timely diagnose and facilitate in effective treatment of the cancer. Biomedicine application process on the thousands of features. It is challenging to extract the precise statistics from this high-dimensional dataset. This paper presents the Incremental Wrapper based Random Forest Gene Subset Selection of Tumor discernment that mechanisms on the principle of incremental wrapper based feature subset selection with random forest classification algorithm and this algorithm also works as performance validator. Incremental wrapper based feature subset selection is a technique to pick out a finest conceivable subset of genes from the high-dimensional data with low computational cost. Random Forest will increase the overall performance as it works better in cancer related high-dimensional dataset. The efficacy of the random forest classification algorithm as performance validator will significantly improve by working on a selective discriminative subset of prognostic genes as compare to the raw data. We evaluate the proposed methodology on the six publicly available cancer related high dimensional datasets and found that the proposed methodology outperform as compare to standard random forests.

Alia Fatima, Usman Qamar, Saad Rehman, Aiman Khan Nazir

Protein Identification as a Suitable Application for Fast Data Architecture

Abstract

Metaproteomics is a field of biology research that relies on mass spectrometry to characterize the protein complement of microbiological communities. Since only identified data can be analyzed, identification algorithms such as X!Tandem, OMSSA and Mascot are essential in the domain, to get insights into the biological experimental data. However, protein identification software has been developed for proteomics. Metaproteomics, in contrast, involves large biological communities, gigabytes of experimental data per sample, and greater amounts of comparisons, given the mixed culture of species in the protein database. Furthermore, the file-based nature of current protein identification tools makes them ill-suited for future metaproteomics research. In addition, possible medical use cases of metaproteomics require near real-time identification. From the technology perspective, Fast Data seems promising to increase throughput and performance of protein identification in a metaproteomics workflow. In this paper we analyze the core functions of the established protein identification engine X!Tandem and show that streaming Fast Data architectures are suitable for protein identification. Furthermore, we point out the bottlenecks of the current algorithms and how to remove them with our approach.

Roman Zoun, Gabriel Campero Durand, Kay Schallert, Apoorva Patrikar, David Broneske, Wolfram Fenske, Robert Heyer, Dirk Benndorf, Gunter Saake

Mining Geometrical Motifs Co-occurrences in the CMS Dataset

Abstract

Precise and efficient retrieval of structural motifs is a task of great interest in proteomics. Geometrical approaches to motif identification allow the retrieval of unknown motifs in unfamiliar proteins that may be missed by widespread topological algorithms. In particular, the Cross Motif Search (CMS) algorithm analyzes pairs of proteins and retrieves every group of secondary structure elements that is similar between the two proteins. These similarities are candidate to be structural motifs. When extended to large datasets, the exhaustive approach of CMS generates a huge volume of data. Mining the output of CMS means identifying the most significant candidate motifs proposed by the algorithm, in order to determine their biological significance. In the literature, effective data mining on a CMS dataset is an unsolved problem.

In this paper, we propose a heuristic approach based on what we call protein “co-occurrences” to guide data mining on the CMS dataset. Preliminary results show that the proposed implementation is computationally efficient and is able to select only a small subset of significant motifs.

Mirto Musci, Marco Ferretti

Suitable Overlapping Set Visualization Techniques and Their Application to Visualize Biclustering Results on Gene Expression Data

Abstract

Biclustering algorithms applied in classification of genomic data have two main theoretical differences compared to traditional clustering ones. First, it provides bi-dimensionality, grouping both genes and conditions together, since a group of genes can be co-regulated for a given condition but not for others. Second, it considers group overlaps, allowing genes to contribute to more than one activity. Visualizing biclustering results is a non-trivial process due to these two characteristics. Heatmaps-based techniques are considered as a standard for visualizing clustering results. They consist on reordering rows and/or columns in order to show clusters as contiguous blocks. However, for biclustering results, this same process cannot be applied without duplicating rows and/or columns. Moreover, a variety of techniques for visualizing sets and their relations has been published in the past recent years. Some of them can be considered as an ideal solution to visualize large sets with high number of possible relations between them. In this paper, we firstly review several set-visualizing techniques that we consider most suitable to satisfy the two mentioned features of biclustering and then, we discuss how these new techniques can visualize biclustering results.

Haithem Aouabed, Rodrigo Santamaría, Mourad Elloumi

Technologies for Information Retrieval (TIR)

Frontmatter

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora

Abstract

Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domain-specificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback–Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.

Marina Santini, Wiktor Strandqvist, Mikael Nyström, Marjan Alirezai, Arne Jönsson

A Case Study of Closed-Domain Response Suggestion with Limited Training Data

Abstract

We analyze the problem of response suggestion in a closed domain along a real-world scenario of a digital library. We present a text-processing pipeline to generate question-answer pairs from chat transcripts. On this limited amount of training data, we compare retrieval-based, conditioned-generation, and dedicated representation learning approaches for response suggestion. Our results show that retrieval-based methods that strive to find similar, known contexts are preferable over parametric approaches from the conditioned-generation family, when the training data is limited. We, however, identify a specific representation learning approach that is competitive to the retrieval-based approaches despite the training data limitation.

Lukas Galke, Gunnar Gerstenkorn, Ansgar Scherp

What to Read Next? Challenges and Preliminary Results in Selecting Representative Documents

Abstract

The vast amount of scientific literature poses a challenge when one is trying to understand a previously unknown topic. Selecting a representative subset of documents that covers most of the desired content can solve this challenge by presenting the user a small subset of documents. We build on existing research on representative subset extraction and apply it in an information retrieval setting. Our document selection process consists of three steps: computation of the document representations, clustering, and selection of documents. We implement and compare two different document representations, two different clustering algorithms, and three different selection methods using a coverage and a redundancy metric. We execute our 36 experiments on two datasets, with 10 sample queries each, from different domains. The results show that there is no clear favorite and that we need to ask the question whether coverage and redundancy are sufficient for evaluating representative subsets.

Tilman Beck, Falk Böschen, Ansgar Scherp

Text-Based Annotation of Scientific Images Using Wikimedia Categories

Abstract

The reuse of scientific raw data is a key demand of Open Science. In the project NOA we foster reuse of scientific images by collecting and uploading them to Wikimedia Commons. In this paper we present a text-based annotation method that proposes Wikipedia categories for open access images. The assigned categories can be used for image retrieval or to upload images to Wikimedia Commons. The annotation basically consists of two phases: extracting salient keywords and mapping these keywords to categories. The results are evaluated on a small record of open access images that were manually annotated.

Frieda Josi, Christian Wartena, Jean Charbonnier

Detecting Link and Landing Page Misalignment in Marketing Emails

Abstract

Links and their landing pages in the World Wide Web are oftentimes flawed or irrelevant. We created a data set of 4266 links within 160 marketing emails whose relevance with landing pages have been evaluated by crowd workers. We present a study of common misalignments and propose methods for detecting these misalignments. An F-score of 0.63 can be achieved by a neural network for cases where the misaligned label requires the majority out of 5 crowd worker votes.

Nedim Lipka, Tak Yeon Lee, Eunyee Koh

Toward Validation of Textual Information Retrieval Techniques for Software Weaknesses

Abstract

This paper presents a preliminary validation of common textual information retrieval techniques for mapping unstructured software vulnerability information to distinct software weaknesses. The validation is carried out with a dataset compiled from four software repositories tracked in the Snyk vulnerability database. According to the results, the information retrieval techniques used perform unsatisfactorily compared to regular expression searches. Although the results vary from a repository to another, the preliminary validation presented indicates that explicit referencing of vulnerability and weakness identifiers is preferable for concrete vulnerability tracking. Such referencing allows the use of keyword-based searches, which currently seem to yield more consistent results compared to information retrieval techniques. Further validation work is required for improving the precision of the techniques, however.

Jukka Ruohonen, Ville Leppänen

Investigating the Effect of Attributes on User Trust in Social Media

Abstract

One main challenge in social media is to identify trustworthy information. If we cannot recognize information as trustworthy, that information may become useless or be lost. Opposite, we could consume wrong or fake information - with major consequences. How does a user handle the information provided before consuming it? Are the comments on a post, the author or votes essential for taking such a decision? Are these attributes considered together and which attribute is more important? To answer these questions, we developed a trust model to support knowledge sharing of user content in social media. This trust model is based on the dimensions of stability, quality, and credibility. Each dimension contains metrics (user role, user IQ, votes, etc.) that are important to the user based on data analysis. We present in this paper, an evaluation of the proposed trust model using conjoint analysis (CA) as an evaluation method. The results obtained from 348 responses, validate the trust model. A trust degree translator interprets the content as very trusted, trusted, untrusted, and very untrusted based on the calculated value of trust. Furthermore, the results show a different importance for each dimension: stability 24%, credibility 35% and quality 41%.

Jamal Al Qundus, Adrian Paschke

Analysing Author Self-citations in Computer Science Publications

Abstract

In scientific papers, citations refer to relevant previous work in order to underline the current line of argumentation, compare to other work and/or avoid repetition in writing. Self-citations, e.g. authors citing own previous work might have the same motivation but have also gained negative attention w.r.t. unjustified improvement of scientific performance indicators. Previous studies on self-citations do not provide a detailed analysis in the domain of computer science. In this work, we analyse the prevalence of self-citations in the DBLP, a digital library for computer science. We find, that approx. 10% of all citations are self-citations, while the rates vary with year after publication and the position of the author in the list as well as with the gender of the lead author. Further, we find that C-ranked venues have the highest incoming self-citation rate, while the outgoing rate is stable across all ranks.

Tobias Milz, Christin Seifert

A Semantic-Based Personalized Information Retrieval Approach Using a Geo-Social User Profile

Abstract

The user’s search history includes some latent semantics that can be used to improve the user’s interests representation. In this paper we present a personalized information retrieval approach that highlights and use this latent semantics. This semantics, which we call personal semantics, is expressed by the different co-occurrence relationships between relevant terms, according to the different user’s search contexts. This personal semantics is integrated in a geo-social user profile for giving it more representativeness of the user interests. Then, to improve the search results relevance, the user profile is used to reformulate the user query for broadening the search scope without going further than the user needs.

Tahar Rafa, Samir Kechid

Backmatter

Titel: Database and Expert Systems Applications
herausgegeben von: Prof. Mourad Elloumi
Michael Granitzer
Abdelkader Hameurlain
Christin Seifert
Prof. Dr. Benno Stein
A Min Tjoa
Prof. Dr. Roland Wagner
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-99133-7
Print ISBN: 978-3-319-99132-0
DOI: https://doi.org/10.1007/978-3-319-99133-7