nach oben

2019 | Buch

Database and Expert Systems Applications

30th International Conference, DEXA 2019, Linz, Austria, August 26–29, 2019, Proceedings, Part I

herausgegeben von: Prof. Dr. Sven Hartmann, Josef Küng, Sharma Chakravarthy, Prof. Dr. Gabriele Anderst-Kotsis, A Min Tjoa, Ismail Khalil

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This two volume set of LNCS 11706 and LNCS 11707 constitutes the refereed proceedings of the 30th International Conference on Database and Expert Systems Applications, DEXA 2019, held in Linz, Austria, in August 2019.

The 32 full papers presented together with 34 short papers were carefully reviewed and selected from 157 submissions. The papers are organized in the following topical sections:

Part I: Big data management and analytics; data structures and data management; management and processing of knowledge; authenticity, privacy, security and trust; consistency, integrity, quality of data; decision support systems; data mining and warehousing.

Part II: Distributed, parallel, P2P, grid and cloud databases; information retrieval; Semantic Web and ontologies; information processing; temporal, spatial, and high dimensional databases; knowledge discovery; web services.

Inhaltsverzeichnis

Frontmatter

Big Data Management and Analytics

Frontmatter

Optimization of Row Pattern Matching over Sequence Data in Spark SQL

Due to the advance of information and communications technology and sensor technology, a large quantity of sequence data (time series data, log data, etc.) are generated and processed every day. Row pattern matching for the sequence data stored in relational databases was standardized as SQL/RPR in 2016. Today, in addition to relational databases, there are many frameworks for processing a large amount of data in parallel and distributed computing environments. They include MapReduce and Spark. Hive and Spark SQL enable us to code data analysis processes in SQL-like query languages. Row pattern matching is also beneficial in Hive and Spark SQL. However, computational cost of the row pattern matching process is large and it is needed to make this process efficient. In this paper, we propose two optimization methods to realize the reduction of computational cost for row pattern matching process. We focus on Spark and show design and implementation of the proposed methods for Spark SQL. We verify by the experiments that our optimization methods really contribute to the reduction of the processing time of Spark SQL queries including row pattern matching.

Kosuke Nakabasami, Hiroyuki Kitagawa, Yuya Nasu

Rainfall Estimation from Traffic Cameras

We propose and evaluate a method for the estimation of rainfall from images from a network of traffic cameras and rain gauges. The method trains a neural network for each camera under the supervision of the rain gauges and interpolates the results to estimate rainfall at any location. We study and evaluate variants of the method that exploit feature extraction and various interpolation methods. We empirically and comparatively demonstrate the superiority of a hybrid approach and of the inverse distance weighting interpolation for an existing comprehensive network of publicly accessible weather stations and traffic cameras.

Remmy Zen, Dewa Made Sri Arsa, Ruixi Zhang, Ngurah Agus Sanjaya ER, Stéphane Bressan

Towards Identifying De-anonymisation Risks in Distributed Health Data Silos

Accessing distributed and isolated data repositories such as medical research and treatment data in a privacy-preserving manner is a challenging problem. Furthermore, in the context of high-dimensional datasets, adhering to strict privacy legislation can be projected to a W[2]-complete problem whereby all privacy violating attribute combinations must be identified. While traditional anonymisation algorithms incur high levels of information loss when applied to high-dimensional data, they often do not guarantee privacy, which defeats the purpose of anonymisation. In this paper, we extend our previous work and address these issues by using Bayesian networks to handle data transformation for anonymisation [29]. By computing conditional probabilities linking attribute pairs for all attribute pair combinations the privacy exposure risk can be assessed. Attribute pairs differing by a high conditional probability indicate a high risk of de-anonymisation, similar to quasi-identifiers in syntactic anonymisation schemes, and can be separated instead of deleted. Attribute compartmentation removes the risk of privacy exposure, and deletion avoidance results in a significant reduction in information loss. In other words, assimilating the conditional probability of outliers directly in the adjacency matrix in a greedy fashion is efficient and privacy-preserving. Further, we offer deeper evaluation insights for optimising Bayesian networks with multigrid solver for aggregating state space explosion.

Nikolai J. Podlesny, Anne V. D. M. Kayem, Christoph Meinel

An Attribute-Based Fine-Grained Access Control Mechanism for HBase

In the current age of big data, the access control mechanism of HBase, a kind of NoSQL big data management system, needs to be improved, because there are some limitations of Role-Based Access Control (RBAC) in HBase. The coarse-grained access permissions produce little effect in many cases, and the elements used for authorization are not comprehensive enough. Attribute-Based Access Control (ABAC) is suitable for the authorization of NoSQL data storages due to its flexibility. However, it has not been investigated in HBase deeply. The objective of this paper is to study the data access control in HBase and to develop an ABAC-based mechanism for the security of HBase data. In light of the wide column feature of HBase, an Attribute-Based Fine-Grained Access Control mechanism (AGAC) is proposed, which covers two aspects, users’ atomic operations and five granularity levels. When a user needs to access data in HBase storage, the AGAC will give the permission or deny by verifying user’s atomic operations and by analyzing user’s attributes according to the access control policies related to the data granularity level. This access control mechanism is verified on publically available email dataset and is proven to be effective to improve the access control capability of HBase.

Liangqiang Huang, Yan Zhu, Xin Wang, Faisal Khurshid

Data Structures and Data Management

Frontmatter

Lopper: An Efficient Method for Online Log Pattern Mining Based on Hybrid Clustering Tree

Large-scale distributed system suffers from the problem that system manager can’t discover, locate and fix system anomaly in time when system malfunctions. People often use system logs for anomaly detection. However, manually inspecting system logs to detect anomaly is unfeasible due to the increasing scale and complexity of distributed systems. As a result, various methods of automatically mining log patterns for anomaly detection have been developed. Existing methods for log pattern mining have drawbacks of either time-consuming or low-accuracy. In order to address these problems, we propose Lopper, a hybrid clustering tree for online log pattern mining. Our method accelerates the mining process by clustering raw log data in one-pass manner and ensures the accuracy by merging and combing similar patterns with different kernel functions in each step. We evaluate our method on massive sets of log data generated in different industrial applications. The experimental results show that Lopper achieves the accuracy with 92.26% on average which is much better than comparative methods and remains high efficiency at the same time. We also conduct experiments on system anomaly detection task using the log patterns generated by Lopper, the results show an average F-Measure performance of 91.97%, which further proves the effectiveness of Lopper.

Jiawei Liu, Zhirong Hou, Ying Li

Discord Monitoring for Streaming Time-Series

Many applications generate time-series and analyze it. One of the most important time-series analysis tools is anomaly detection, and discord discovery aims at finding an anomaly subsequence in a time-series. Time-series is essentially dynamic, so monitoring the discord of a streaming time-series is an important problem. This paper addresses this problem and proposes SDM (Streaming Discord Monitoring), an algorithm that efficiently updates the discord of a streaming time-series over a sliding window. We show that SDM is approximation-friendly, i.e., the computational efficiency is accelerated by monitoring an approximate discord with theoretical bound. Our experiments on real datasets demonstrate the efficiency of SDM and its approximate version.

Shinya Kato, Daichi Amagata, Shunya Nishio, Takahiro Hara

Partially Indexing on Flash Memory

Query indexing is a mature technique in relational databases. Organizing as tree-like structures, the indexes facilitate data access and speed up query processing. Nevertheless, the construction and modification of the indexes is very expensive and can slow down the database performance. Traditional approaches cover all records equally, even if some records are queried often and some never. To avoid this problem, partially indexing has been introduced. The core idea is to create indexes adaptively and incrementally as a side-product of query processing. In this way, only such records are indexed which take part in the queries. After emerging modern data storage technologies like: flash memory or phase change memory, the new index types appeared. They have been invented to overcome the limitations of such technologies. In this paper, we deal with partially indexing on flash memory. We propose a method which reduces the number of write and erase operations on flash memory during index creation. Due to employing optimization techniques specific for flash memory, the query response time is decreased twice in comparison to the traditional methods. As far as we know, it is the first approach which considers partially indexing on the physical data storage level. Thus, the paper may be the initiation of a new research direction.

Wojciech Macyna, Michal Kukowski

HGraph: A Connected-Partition Approach to Proximity Graphs for Similarity Search

Similarity search is a common approach to support new applications that deal with complex data (e.g., images, videos, georeferenced data, etc.). As a consequence, appropriate indexing structures to support this task have been proposed in the literature. Recently, graph-based methods have shown to be very efficient for approximate similarity search. However, some of the main types of graphs used still suffer from two main drawbacks: (i) slow construction, and (ii) inaccurate retrieval. To reduce these drawbacks, in this paper, we propose the HGraph method. HGraph is a divide-and-conquer method for building graphs for similarity search that recursively partitions the input dataset and connect vertices across partitions at different levels. The method can be used with different types of graphs proposed in the literature to speed up the graph construction time as well as to increase the approximate search results quality through long-range edges connecting pivots of different partitions. We present experimental results using real datasets that show that HGraph applied to the k-NNG graph was able to decrease the construction time while increasing the approximate search recall when compared to the k-NNG. Regarding the application of HGraph to the NSW graph, the query recall also increased, however with a higher computational cost. An analysis of different combinations of the tested methods demonstrated HGraph query times given a recall rate were always among the top results regarding different setups.

Larissa Capobianco Shimomura, Daniel S. Kaster

Management and Processing of Knowledge

Frontmatter

Statistical Processing of Stopwords on SNS

For the purpose of text classification or information retrieval, we apply preprocessing to these texts such as stemming and stopwords removal. Almost all the techniques could be useful only to well-formed text information like textbooks and news articles, but is not true to social network services (SNS) or any other texts in internet world. In this investigation, we propose how to extract stopwords in context of social network services. To do that, first we discuss what stopwords mean, how different from conventional ones, and we propose statistical filters TFIG and TFCHI, to identify. We examine categorical estimation to extract characteristic values putting our attention on Kullback Leibler Divergence (KLD) over temporal sequences on SNS data. Moreover we apply several preprocessing to manage unknown words and to improve morphological analysis.

Yuta Nezu, Takao Miura

Multiple Choice Question Answering in the Legal Domain Using Reinforced Co-occurrence

Nowadays, the volume of legal information available is continuously growing. As a result, browsing and querying this huge legal corpus in search of specific information is currently a tedious task exacerbated by the fact that data presentation does not usually meet the needs of professionals in the sector. To satisfy these ever-increasing needs, we have designed an appropriate solution to provide an adaptive and intelligent solution for the automatic answer of questions of legal content based on the computation of reinforced co-occurrence, i.e. a very demanding type of co-occurrence that requires large volumes of information but guarantees good results. This solution is based on the pattern-based methods that have been already successfully applied in information extraction research. An empirical evaluation over a dataset of legal questions seems to indicate that this solution is promising.

Jorge Martinez-Gil, Bernhard Freudenthaler, A Min Tjoa

A Probabilistic Algorithm to Predict Missing Facts from Knowledge Graphs

Knowledge Graph, as the name says, is a way to represent knowledge using a directed graph structure (nodes and edges). However, such graphs are often incomplete or contain a considerable amount of wrong facts. This work presents ProA: a probabilistic algorithm to predict missing facts from Knowledge Graphs based on the probability distribution over paths between entities. Compared to current state-of-the-art approaches, ProA has the following advantages: simplicity as it considers only the topological structure of a knowledge graph, good performance as it does not require any complex calculations, and readiness as it has no other requirement but the graph itself.

André Gonzaga, Mirella Moro, Mário S. Alvim

Semantic Oppositeness Embedding Using an Autoencoder-Based Learning Model

Semantic oppositeness is the natural counterpart of the much popular natural language processing concept, semantic similarity. Much like how semantic similarity is a measure of the degree to which two concepts are similar, semantic oppositeness yields the degree to which two concepts would oppose each other. This complementary nature has resulted in most applications and studies incorrectly assuming semantic oppositeness to be the inverse of semantic similarity. In other trivializations, “semantic oppositeness” is used interchangeably with “antonymy”, which is as inaccurate as replacing semantic similarity with simple synonymy. These erroneous assumptions and over-simplifications exist due, mainly, to either lack of information, or the computational complexity of calculation of semantic oppositeness. The objective of this research is to prove that it is possible to extend the idea of word vector embedding to incorporate semantic oppositeness, so that an effective mapping of semantic oppositeness can be obtained in a given vector space. In the experiments we present in this paper, we show that our proposed method achieves a training accuracy of 97.91% and a test accuracy of 97.82%, proving the applicability of this method even in potentially highly sensitive applications and dispelling doubts of over-fitting. Further, this work also introduces a novel, unanchored vector embedding method and a novel, inductive transfer learning process.

Nisansa de Silva, Dejing Dou

COMET: A Contextualized Molecule-Based Matching Technique

Context-specific description of entities –expressed in RDF– poses challenges during data-driven tasks, e.g., data integration, and context-aware entity matching represents a building-block for these tasks. However, existing approaches only consider inter-schema mapping of data sources, and are not able to manage several contexts during entity matching. We devise COMET, an entity matching technique that relies on both the knowledge stated in RDF vocabularies and context-based similarity metrics to match contextually equivalent entities. COMET executes a novel 1-1 perfect matching algorithm for matching contextually equivalent entities based on the combined scores of semantic similarity and context similarity. COMET employs the Formal Concept Analysis algorithm in order to compute the context similarity of RDF entities. We empirically evaluate the performance of COMET on a testbed from DBpedia. The experimental results suggest that COMET is able to accurately match equivalent RDF graphs in a context-dependent manner.

Mayesha Tasnim, Diego Collarana, Damien Graux, Mikhail Galkin, Maria-Esther Vidal

Authenticity, Privacy, Security and Trust

Frontmatter

Differentially Private Non-parametric Machine Learning as a Service

Machine learning algorithms create models from training data for the purpose of estimation, prediction and classification. While releasing parametric machine learning models requires the release of the parameters of the model, releasing non-parametric machine learning models requires the release of the training dataset along with the parameters. The release of the training dataset creates a risk of breach of privacy. An alternative to the release of the training dataset is the presentation of the non-parametric model as a service. Still, the non-parametric model as a service may leak information about the training dataset.We study how to provide differential privacy guarantees for non-parametric models as a service. We show how to apply the perturbation to the model functions of histogram, kernel density estimator, kernel SVM and Gaussian process regression in order to provide $$(\epsilon , \delta )$$ -differential privacy. We empirically evaluate the trade-off between the privacy guarantee and the error incurred for each of these non-parametric machine learning algorithms on benchmarks and real-world datasets.Our contribution is twofold. We show that functional perturbation is not only pragmatic for releasing machine learning models as a service but also yields higher effectiveness than output perturbation mechanisms for specified privacy parameters. We show a practical step to perturbate the model functions of histogram, kernel SVM, Gaussian process regression along with kernel density estimator and perform evaluation on a real-world dataset as well as a selection of benchmarks.

Ashish Dandekar, Debabrota Basu, Stéphane Bressan

PURE: A Privacy Aware Rule-Based Framework over Knowledge Graphs

Open data initiatives and FAIR data principles have encouraged the publication of large volumes of data, encoding knowledge relevant for the advance of science and technology. However, to mine knowledge, it is usually required the processing of data collected from sources regulated by diverse access and privacy policies. We address the problem of enforcing data privacy and access regulations (EDPR) and propose PURE, a framework able to solve this problem during query processing. PURE relies on the local as view approach for defining the rules that represent the access control policies imposed over a federation of RDF knowledge graphs. Moreover, PURE maps the problem of checking if a query meets the privacy regulations to the problem of query rewriting (QRP) using views; it resorts to state-of-the-art QRP solutions for determining if a query violates or not the defined policies. We have evaluated the efficiency of PURE over the Berlin SPARQL Benchmark (BSBM). Observed results suggest that PURE is able to scale up to complex scenarios where a large number of rules represents diverse types of policies.

Marlene Goncalves, Maria-Esther Vidal, Kemele M. Endris

FFT-2PCA: A New Feature Extraction Method for Data-Based Fault Detection

The industrial environment requires constant attention for faults on processes. This concern has central importance both for workers safety and process efficiency. Modern Process Automation Systems are capable of produce a large amount of data; upon this data, machine learning algorithms can be trained to detect faults. However, this data high complexity and dimensionality causes a decrease in these algorithms quality metrics. In this work, we introduce a new feature extraction method to improve the quality metrics of data-based fault detection. Our method uses a Fast Fourier Transform (FFT) to extract a temporal signature from the input data, to reduce the feature dimensionality generated by signature extraction, we apply a sequence of Principal Component Analysis (PCA). Then, the feature extraction output feeds a classification algorithm. We achieve an overall improvement of 17.4% on F1 metric for the ANN classifier. Also, due to intrinsic FFT characteristics, we verified a meaningful reduction in development time for data-based fault detection solution.

Matheus Maia de Souza, João Cesar Netto, Renata Galante

Consistency, Integrity, Quality of Data

Frontmatter

A DaQL to Monitor Data Quality in Machine Learning Applications

Machine learning models can only be as good as the data used to train them. Despite this obvious correlation, there is little research about data quality measurement to ensure the reliability and trustworthiness of machine learning models. Especially in industrial settings, where sensors produce large amounts of highly volatile data, a one-time measurement of the data quality is not sufficient since errors in new data should be detected as early as possible. Thus, in this paper, we present DaQL (Data Quality Library), a generally-applicable tool to continuously monitor the quality of data to increase the prediction accuracy of machine learning models. We demonstrate and evaluate DaQL within an industrial real-world machine learning application at Siemens.

Lisa Ehrlinger, Verena Haunschmid, Davide Palazzini, Christian Lettner

Automated Detection and Monitoring of Advanced Data Quality Rules

Nowadays business decisions heavily rely on data in data warehouse systems (DWH), thus data quality (DQ) in DWH is a highly relevant topic. Consequently, sophisticated yet still easy to use solutions for monitoring and ensuring high data quality are needed. This paper is based on the IQM4HD project in which a prototype of an automated data quality monitoring system has been designed and implemented. Specifically, we focus on the aspect of expressing advanced data quality rules such as checking whether data conforms to a certain time series or whether data deviates significantly in any of the dimensions within a data cube. We show how such types of data quality rules can be expressed in our domain specific language (DSL) RADAR which has been introduced in [10]. Since manual specification of such rules tends to be complex, it is particularly important to support the DQ manager in detecting and creating potential rules by profiling of historic data. Thus we also explain the data profiling component of our prototype and illustrate how advanced rules can be semi-automatically detected and suggested to the DQ manager.

Felix Heine, Carsten Kleiner, Thomas Oelsner

Effect of Imprecise Data Income-Flow Variability on Harvest Stability: A Quantile-Based Approach

Retrieved data from sensors may have a high level of quality to ensure crucial decisions and determine effective strategies. Nowadays, in view of the mass of generated information from these data, there is a real need to handle their quality. This paper propose new indices for quantifying the variability/stability of a data flow according to a data modeling that handles data imperfection. To deal with the data imprecision, we adopt a quantile-based approach. Our index definitions use parameters. Hence, to obtain an efficient judgments by this approach, we examine the choice of the appropriate parameters, and how it can affect the judgment on the harvest stability.

Zied ben Othmane, Cyril de Runz, Amine Ait Younes, Vincent Mercelot

Decision Support Systems

Frontmatter

Fairness-Enhancing Interventions in Stream Classification

The wide spread usage of automated data-driven decision support systems has raised a lot of concerns regarding accountability and fairness of the employed models in the absence of human supervision. Existing fairness-aware approaches tackle fairness as a batch learning problem and aim at learning a fair model which can then be applied to future instances of the problem. In many applications, however, the data comes sequentially and its characteristics might evolve with time. In such a setting, it is counter-intuitive to “fix” a (fair) model over the data stream as changes in the data might incur changes in the underlying model therefore, affecting its fairness. In this work, we propose fairness-enhancing interventions that modify the input data so that the outcome of any stream classifier applied to that data will be fair. Experiments on real and synthetic data show that our approach achieves good predictive performance and low discrimination scores over the course of the stream.

Vasileios Iosifidis, Thi Ngoc Han Tran, Eirini Ntoutsi

Early Turnover Prediction of New Restaurant Employees from Their Attendance Records and Attributes

It is widely known that the turnover rate of new employees is high. Several studies have been conducted on filtering candidates during the recruitment process to avoid hiring employees that are likely to leave early. However, studies on the prediction of early turnover of new employees, which might enable appropriate interventions, are scarce. In the restaurant industry, which suffers from labor shortages, filtering candidates is unrealistic, and it is important to maintain newly hired employees. In this study, we propose a new model, based on recurrent neural networks, that predicts the early turnover of new restaurant employees by using their attendance records and attributes. We have evaluated the effectiveness of the proposed model by using anonymized data from a restaurant chain in Japan, and we confirmed that the proposed model performs better than baseline models. Furthermore, our analysis revealed that gender and hiring channel had little influence on early turnover and decreased prediction performance. We believe that these results will help in designing efficient interventions to prevent new restaurant employees from leaving early.

Koya Sato, Mizuki Oka, Kazuhiko Kato

An Efficient Premiumness and Utility-Based Itemset Placement Scheme for Retail Stores

In retail stores, the placement of items on the shelf space significantly impacts the sales of items. In particular, the probability of sales of a given item is typically considerably higher when it is placed in a premium (i.e., highly visible/easily accessible) slot as opposed to a non-premium slot. In this paper, we address the problem of maximizing the revenue for the retailer by determining the placement of the itemsets in different types of slots with varied premiumness such that each item is placed at least once in any of the slots. We first propose the notion of premiumness of slots in a given retail store. Then we discuss a framework for efficiently identifying itemsets from a transactional database and placing these itemsets by mapping itemsets with different revenue to slots with varied premiumness for maximizing retailer revenue. Our performance evaluation on both synthetic and real datasets demonstrate that the proposed scheme indeed improves the retailer revenue by up to 45% w.r.t. a recent existing scheme.

Parul Chaudhary, Anirban Mondal, Polepalli Krishna Reddy

Data Lakes: Trends and Perspectives

As a relatively new concept, data lake has neither a standard definition nor an acknowledged architecture. Thus, we study the existing work and propose a complete definition and a generic and extensible architecture of data lake. What’s more, we introduce three future research axes in connection with our health-care Information Technology (IT) activities. They are related to (i) metadata management that consists of intra- and inter-metadata, (ii) a unified ecosystem for companies’ data warehouses and data lakes and (iii) data lake governance.

Franck Ravat, Yan Zhao

An Efficient Greedy Algorithm for Sequence Recommendation

Recommending a sequence of items that maximizes some objective function arises in many real-world applications. In this paper, we consider a utility function over sequences of items where sequential dependencies between items are modeled using a directed graph. We propose EdGe, an efficient greedy algorithm for this problem and we demonstrate its effectiveness on both synthetic and real datasets. We show that EdGe achieves comparable recommendation precision to the state-of-the-art related work OMEGA, and in considerably less time. This work opens several new directions that we discuss at the end of the paper.

Idir Benouaret, Sihem Amer-Yahia, Senjuti Basu Roy

Discovering Diverse Popular Paths Using Transactional Modeling and Pattern Mining

While the problems of finding the shortest path and k-shortest paths have been extensively researched, the research community has been shifting its focus towards discovering and identifying paths based on user preferences. Since users naturally follow some of the paths more than other paths, the popularity of a given path often reflects such user preferences. Moreover, users typically prefer diverse paths over similar paths for gaining flexibility in path selection. Given a set of user traversals in a road network and a set of paths between a given source and destination pair, we propose a scheme based on transactional modeling and pattern mining for performing top-k ranking of these paths based on both path popularity and path diversity. Our performance evaluation with a real dataset demonstrates the effectiveness of the proposed scheme.

P. Revanth Rathan, P. Krishna Reddy, Anirban Mondal

Data Mining and Warehousing

Frontmatter

Representative Sample Extraction from Web Data Streams

Smart or digital city infrastructures facilitate both decision support and strategic planning with applications such as government services, healthcare, transport and traffic management. Generally, each service generates multiple data streams using different data models and structures. Thus, any form of analysis requires some form of extract-transform-load process normally associated with data warehousing to ensure proper cleaning and integration of heterogeneous datasets. In addition, data produced by these systems may be generated at a rate which cannot be captured completely using standard computing resources. In this paper, we present an ETL system for transport data coupled with a smart data acquisition methodology to extract a subset of data suitable for analysis.

Michael Scriney, Congcong Xing, Andrew McCarren, Mark Roantree

LogLInc: LoG Queries of Linked Open Data Investigator for Cube Design

By avoiding the ‘data not invented here’ syndrome (NIH) (Data not invented here (NIH) syndrome is a mindset that consists in focusing solely on using data created inside the walls of a business ( https://urlz.fr/9Yo9 )), companies realized the benefit of including external sources in their data cube. In this context, Linked Open Data (LOD) is a promising external source that may contain valuable data and query-logs materializing the exploration of data by end users. Paradoxically, the dataset of this external source is structured whereas logs are “ugly”, and in the case, they are turned into rich structured data, they will contribute to building valuable data cubes. In this paper, we claim that the NIH syndrome must be also considered for query-logs. As a consequence, we propose an approach that investigates the particularity of SPARQL query logs performed on the LOD and augmented by the LOD to discover multidimensional patterns when leveraging and enriching a data cube. To show the effectiveness of our approach, different scenarios are proposed and evaluated using DBpedia.

Selma Khouri, Dihia Lanasri, Roaya Saidoune, Kamila Boudoukha, Ladjel Bellatreche

Handling the Information Backlog for Data Warehouse Development

In both, bus of data marts and agile data warehouse development, it is required to select the next product part to be developed. Whereas this issue is not addressed in the former, agile development approaches leave selection from a backlog to the product owner. We provide an approach to backlog selection that is rooted in the decision-making process. Our task is to first select a decision from a backlog of decisions and then select from the backlog of information relevant to it. Three yardsticks are considered for decision selection, business importance, decision structure and picking complete decisions in preference to partial decisions. The backlog of information for a selected decision may be for the Intelligence or Choice phases of the decision-making process.

Naveen Prakash, Deepika Prakash

Ontario: Federated Query Processing Against a Semantic Data Lake

Data lakes enable flexible knowledge discovery and reduce the overhead of materialized data integration. Albeit effective for data storage, query execution over data lakes may be expensive, being demanded novel techniques to generate plans able to exploit the main characteristics of data lakes. We devise Ontario, a federated query processing approach tailored for large-scale heterogeneous data. Ontario provides efficient and effective query processing over a federation of heterogeneous data sources in a data lake. Ontario resorts to source descriptions named RDF Molecule Templates, i.e., abstract descriptions of the properties of the entities in a unified schema and their implementation in a data lake. We empirically evaluate the effectiveness of the Ontario optimization techniques over state-of-the-art benchmarks. The observed results suggest that Ontario can effectively select plans composed of subqueries that can be efficiently executed against heterogeneous data sources in a data lake.

Kemele M. Endris, Philipp D. Rohde, Maria-Esther Vidal, Sören Auer

A Model-Driven Framework for the Modeling and the Description of Data-as-a-Service to Assist Service Selection and Composition

Data as a Service (DaaS) is seen as a promising cloud offering for wrangling the overload of information and making it available across cloud platforms anytime and anywhere. While there exist a large number of DaaS providers in the market, each one has a different way to describe its provided services as well as supplied datasets. The lack of a well-defined machine-readable model strongly hinders the automatic selection and composition of DaaSs. This paper presents MoDaaS, a model-driven framework for the modeling and the description of DaaS services. MoDaaS enables DaaS providers to describe their services capabilities and concerns according to a shared ontology, thereafter it enables them to automatically generate service views in order to assist the integration and data exchange between heterogeneous services.

Hiba Alili, Rim Drira, Khalid Belhajjame, Henda Hajjami Ben Ghezala, Daniela Grigori

Named Entity Recognition in Local Intent Web Search Queries

Semantic understanding of web queries is a challenging problem as web queries are short, noisy and usually do not observe the grammar of a written language. In this paper, we specifically study the user web search queries with local intent on Bing. Local intent queries deal with searching for local businesses and services in a location. Hence, local query parsing translates into the classical problem of Named Entity Recognition (NER) in NLP. State-of-the-art NER systems rely heavily on hand-crafted features and domain-specific knowledge to effectively learn from the small, supervised training corpora that is available. In this paper, we use deep learnt neural model that relies solely on features extracted from word embeddings learnt in an unsupervised way, using search logs. We propose a novel technique for generating domain specific embeddings and show that they significantly improve the performance of existing models for the NER task. Our model outperforms the existing CRF based parser currently used in production.

Saloni Mittal, Manoj K. Agarwal

Database Processing-in-Memory: A Vision

The recent trend of Processing-in-Memory (PIM) promises to tackle the memory and energy wall problems lurking in the data movement around the memory hierarchy, like in data analysis applications. In this paper, we present our vision on how database systems can embrace PIM in query processing. We share with the community an empirical analysis of the pros/cons of PIM in three main query operators to discuss our vision. We also present promising results of our ongoing work to build a PIM-aware query scheduler that improved query execution in almost 3 $$\times $$ and reduced energy consumption in at least 25%. We complete our discussion with challenges and opportunities to foster research impulses in the co-design of Database-PIM.

Tiago R. Kepe, Eduardo C. Almeida, Marco A. Z. Alves, Jorge A. Meira

Context-Aware GANs for Image Generation from Multimodal Queries

In this paper, we propose a novel model of context-aware generative adversarial networks (GANs) to generate images from a multimodal query: a pair of condition text and context image. In our study, context is defined as the objects and concepts that appear in the image but not in the text. We construct two object trees expressing the objects and the corresponding hierarchical relationships described in the input condition text and context image, respectively. We compare these two object trees to extract the context. Then, based on the extracted context, we generate parameters for the generator in context-aware GANs. To guarantee that the generated image is related to the multimodal query, i.e., both the condition text and context image, we also construct a context discriminator in addition to the condition discriminator, similar to that of conditional GANs. The experimental results reveal that the prepared model generates images with higher resolutions, containing more contextual information than previous models.

Kenki Nakamura, Qiang Ma

Backmatter

Titel: Database and Expert Systems Applications
herausgegeben von: Prof. Dr. Sven Hartmann
Josef Küng
Sharma Chakravarthy
Prof. Dr. Gabriele Anderst-Kotsis
A Min Tjoa
Ismail Khalil
Verlag: Springer International Publishing
Electronic ISBN: 978-3-030-27615-7
Print ISBN: 978-3-030-27614-0
DOI: https://doi.org/10.1007/978-3-030-27615-7

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Big Data Management and Analytics

Frontmatter

Optimization of Row Pattern Matching over Sequence Data in Spark SQL

Rainfall Estimation from Traffic Cameras

Towards Identifying De-anonymisation Risks in Distributed Health Data Silos

An Attribute-Based Fine-Grained Access Control Mechanism for HBase

Data Structures and Data Management

Frontmatter

Lopper: An Efficient Method for Online Log Pattern Mining Based on Hybrid Clustering Tree

Discord Monitoring for Streaming Time-Series

Partially Indexing on Flash Memory

HGraph: A Connected-Partition Approach to Proximity Graphs for Similarity Search

Management and Processing of Knowledge

Frontmatter

Statistical Processing of Stopwords on SNS

Multiple Choice Question Answering in the Legal Domain Using Reinforced Co-occurrence

A Probabilistic Algorithm to Predict Missing Facts from Knowledge Graphs

Semantic Oppositeness Embedding Using an Autoencoder-Based Learning Model

COMET: A Contextualized Molecule-Based Matching Technique

Authenticity, Privacy, Security and Trust

Frontmatter

Differentially Private Non-parametric Machine Learning as a Service

PURE: A Privacy Aware Rule-Based Framework over Knowledge Graphs

FFT-2PCA: A New Feature Extraction Method for Data-Based Fault Detection

Consistency, Integrity, Quality of Data

Frontmatter

A DaQL to Monitor Data Quality in Machine Learning Applications

Automated Detection and Monitoring of Advanced Data Quality Rules

Effect of Imprecise Data Income-Flow Variability on Harvest Stability: A Quantile-Based Approach

Decision Support Systems

Frontmatter

Fairness-Enhancing Interventions in Stream Classification

Early Turnover Prediction of New Restaurant Employees from Their Attendance Records and Attributes

An Efficient Premiumness and Utility-Based Itemset Placement Scheme for Retail Stores

Data Lakes: Trends and Perspectives

An Efficient Greedy Algorithm for Sequence Recommendation

Discovering Diverse Popular Paths Using Transactional Modeling and Pattern Mining

Data Mining and Warehousing

Frontmatter

Representative Sample Extraction from Web Data Streams

LogLInc: LoG Queries of Linked Open Data Investigator for Cube Design

Handling the Information Backlog for Data Warehouse Development

Ontario: Federated Query Processing Against a Semantic Data Lake

A Model-Driven Framework for the Modeling and the Description of Data-as-a-Service to Assist Service Selection and Composition

Named Entity Recognition in Local Intent Web Search Queries

Database Processing-in-Memory: A Vision

Context-Aware GANs for Image Generation from Multimodal Queries

Backmatter

Premium Partner