Skip to main content

About this book

This book constitutes revised selected papers from the International Workshop on Data Quality and Trust in Big Data, QUAT 2018, which was held in conjunction with the International Conference on Web Information Systems Engineering, WISE 2018, in Dubai, UAE, in November 2018.

The 9 papers presented in this volume were carefully reviewed and selected from 15 submissions. They deal with novel ideas and solutions related to the problems of exploring, assessing, monitoring, improving, and maintaining the quality of data and trust for Big Data.

Table of Contents


A Novel Data Quality Metric for Minimality

The development of well-founded metrics to measure data quality is essential to estimate the significance of data-driven decisions, which are, besides others, the basis for artificial intelligence applications. While the majority of research into data quality refers to the data values of an information system, less research is concerned with schema quality. However, a poorly designed schema negatively impacts the quality of the data, for example, redundancies at the schema-level lead to inconsistencies and anomalies at the data-level. In this paper, we propose a new metric to measure the minimality of a schema, which is an important indicator to detect redundancies. We compare it to other minimality metrics and show that it is the only one that fulfills all requirements for a sound data quality metric. In our ongoing research, we are evaluating the benefits of the metric in more detail and investigate its applicability for redundancy detection in data values.
Lisa Ehrlinger, Wolfram Wöß

Automated Schema Quality Measurement in Large-Scale Information Systems

Assessing the quality of information system schemas is crucial, because an unoptimized or erroneous schema design has a strong impact on the quality of the stored data, e.g., it may lead to inconsistencies and anomalies at the data-level. Even if the initial schema had an ideal design, changes during the life cycle can negatively affect the schema quality and have to be tackled. Especially in Big Data environments there are two major challenges: large schemas, where manual verification of schema and data quality is very arduous, and the integration of heterogeneous schemas from different data models, whose quality cannot be compared directly. Thus, we present a domain-independent approach for automatically measuring the quality of large and heterogeneous (logical) schemas. In contrast to existing approaches, we provide a fully automatable workflow that also enables regular reassessment. Our implementation allows to measure the quality dimensions correctness, completeness, pertinence, minimality, readability, and normalization.
Lisa Ehrlinger, Wolfram Wöß

Email Importance Evaluation in Mailing List Discussions

Nowadays, mailing lists are widely used in team work for discussion and consultation. Identifying important emails in mailing list discussions could significantly benefit content summary and opinion leader recognition. However, previous studies only focus on the importance evaluation methods regarding personal emails, and there is no consensus on the definition of important emails. Therefore, in this paper we consider the characteristics of mailing lists and study how to evaluate email importance in mailing list discussions. Our contribution mainly includes the following aspects. First, we propose ER-Match, an email conversation thread reconstruction algorithm that takes nested quotation relationships into consideration while constructing the email relationship network. Based on the email relationship network, we formulate the importance of emails in mailing list discussions. Second, we propose a feature-rich learning method to predict the importance of new emails. Furthermore, we characterize various factors affecting email importance in mailing list discussions. Experiments with publicly available mailing lists show that our prediction model outperforms baselines with large gains.
Kun Jiang, Chunming Hu, Jie Sun, Qi Shen, Xiaohan Jiang

SETTRUST: Social Exchange Theory Based Context-Aware Trust Prediction in Online Social Networks

Trust is context-dependent. In real-world scenarios, people trust each other only in certain contexts. However, this concept has not been seriously taken into account in most of the existing trust prediction approaches in Online Social Networks (OSNs). In addition, very few attempts have been made on trust prediction based on social psychology theories. For decades, social psychology theories have attempted to explain people’s behaviors in social networks; hence, employing such theories for trust prediction in OSNs will enhance accuracy. In this paper, we apply a well-known psychology theory, called Social Exchange Theory (SET), to evaluate the potential trust relation between users in OSNs. Based on SET, one person starts a relationship with another person, if and only if the costs of that relationship are less than its benefits. To evaluate potential trust relations in OSNs based on SET, we first propose some factors to capture the costs and benefits of a relationship. Then, based on these factors, we propose a trust metric called Trust Degree; at that point, we propose a trust prediction method, based on Matrix Factorization and apply the context of trust in a mathematical model. Finally, we conduct experiments on two real-world datasets to demonstrate the superior performance of our approach over the state-of-the-art approaches.
Seyed Mohssen Ghafari, Shahpar Yakhchi, Amin Beheshti, Mehmet Orgun

CNR: Cross-network Recommendation Embedding User’s Personality

With the explosive growth of available data, recommender systems have become an essential tool to ease users with their decision-making procedure. One of the most challenging problems in these systems is the data sparsity problem, i.e., lack of sufficient amount of available users’ interactions data. Recently, cross-network recommender systems with the idea of integrating users’ activities from multiple domain were presented as a successful solution to address this problem. However, most of the existing approaches utilize users’ past behaviour to discover users’ preferences on items’ patterns and then suggest similar items to them in the future. Hence, their performance may be limited due to ignore recommending divers items. Users are more willing to be recommended with a variety set of items not similar to those they preferred before. Therefore, diversity plays a crucial role to evaluate the recommendation quality. For instance, users who used to watch comedy movie, may be less likely to receive thriller movie, leading to redundant type of items and decreasing user’s satisfaction. In this paper, we aim to exploit user’s personality type and incorporate it as a primary and enduring domain-independent factor which has a strong correlation with user’s preferences. We present a novel technique and an algorithm to capture users’ personality type implicitly without getting users’ feedback (e.g., filling questionnaires). We integrate this factor into matrix factorization model and demonstrate the effectiveness of our approach, using a real-world dataset.
Shahpar Yakhchi, Seyed Mohssen Ghafari, Amin Beheshti

Firefly Algorithm with Proportional Adjustment Strategy

Firefly algorithm is a new heuristic intelligent optimization algorithm and has excellent performance in many optimization problems. However, in the face of some multimodal and high-dimensional problems, the algorithm is easy to fall into the local optimum. In order to avoid this phenomenon, this paper proposed an improved firefly algorithm with proportional adjustment strategy for alpha and beta. Thirteen well-known benchmark functions are used to verify the performance of our proposed algorithm, the computational results show that our proposed algorithm is more efficient than many other FA algorithms.
Jing Wang, Guiyuan Liu, William Wei Song

A Formal Taxonomy of Temporal Data Defects

Data quality assessment outcomes are essential for analytical processes reliability, especially when they are related to temporal data. Such outcomes depend on efficiency and efficacy of (semi-)automated approaches that are determined by understanding the problem associated with each data defect. Despite the small number of works that describe temporal data defects regarding to accuracy, completeness and consistency, there is a significant heterogeneity of terminology, nomenclature, description depth and number of examined defects. To cover this gap, this work reports a taxonomy that organizes temporal data defects according to a five-step methodology. The proposed taxonomy enhances the descriptions and coverage of defects with regard to the related works, and also supports certain requirements of data quality assessment, including the design of visual analytics solutions to support data quality assessment.
João Marcelo Borovina Josko

Data-Intensive Computing Acceleration with Python in Xilinx FPGA

Data-intensive workloads drive the development of hardware design. Such data intensive services are driven the raising trend of novel machine learning techniques, such as CNN/RNN, over massive chunks of data objects. These services require novel devices with configurable high throughput in I/O (i.e., data-based model training), and uniquely large computation capability (i.e., large number of convolutional operations). In this paper, we present our early work on realizing a python-based Field-Programmable Gate Array (FPGA) system to support such data-intensive services. In our current system, we deploy a light layer of CNN optimization and a mixed hardware setup, including multiple FPGA/GPU nodes, to provide performance acceleration on the run. Our prototype can support popular machine learning platform, such as Caffe, etc. Our initial empirical results show that our system can perfect handling all data-intensive learning services.
Yalin Yang, Linjie Xu, Zichen Xu, Yuhao Wang

Delone and McLean IS Success Model for Evaluating Knowledge Sharing

It is generally agreed upon that Knowledge Sharing (KS) is an effective process within organizational settings. It is also the corner-stone of many firm’s Knowledge Management (KM) Strategy. Despite the growing significance of KS for organization’s competitiveness and performance, analyzing the level of KS make it difficult for KM to achieve the optimum level of KS. Because of these causes, this study attempts to develop a conceptual model based on one of the IS Theories that is determined as the best model for evaluating the level of KS. In other words, it is Delone and McLean IS Success model that is presented according to the Communication Theory and it covers various perspectives of assessing Information Systems (IS). Hence, these dimensions cause it to be a multidimensional measuring model that could be a suitable model for realizing the level of KS.
Azadeh Sarkheyli, William Wei Song


Additional information

Premium Partner

    Image Credits