2022 | Book

# Knowledge Discovery from Multi-Sourced Data

Authors: Dr. Chen Ye, Prof. Hongzhi Wang, Prof. Guojun Dai

Publisher: Springer Nature Singapore

Book Series : SpringerBriefs in Computer Science

Part of:

insite
SEARCH

This book addresses several knowledge discovery problems on multi-sourced data where the theories, techniques, and methods in data cleaning, data mining, and natural language processing are synthetically used. This book mainly focuses on three data models: the multi-sourced isomorphic data, the multi-sourced heterogeneous data, and the text data. On the basis of three data models, this book studies the knowledge discovery problems including truth discovery and fact discovery on multi-sourced data from four important properties: relevance, inconsistency, sparseness, and heterogeneity, which is useful for specialists as well as graduate students. Data, even describing the same object or event, can come from a variety of sources such as crowd workers and social media users. However, noisy pieces of data or information are unavoidable. Facing the daunting scale of data, it is unrealistic to expect humans to “label” or tell which data source is more reliable. Hence, it is crucial to identify trustworthy information from multiple noisy information sources, referring to the task of knowledge discovery. At present, the knowledge discovery research for multi-sourced data mainly faces two challenges. On the structural level, it is essential to consider the different characteristics of data composition and application scenarios and define the knowledge discovery problem on different occasions. On the algorithm level, the knowledge discovery task needs to consider different levels of information conflicts and design efficient algorithms to mine more valuable information using multiple clues. Existing knowledge discovery methods have defects on both the structural level and the algorithm level, making the knowledge discovery problem far from totally solved.

##### Chapter 1. Introduction
Abstract
In the age of information explosion, data has penetrated every aspect of our lives. Different data sources, such as social networks, sensing devices, and crowdsourcing platforms, constantly generate data. Even for the same object, various data sources provide its information. Intuitively, analyzing these multi-source data yields valuable information. On the personal level, enterprises can recommend targeted products by analyzing the comments of their target customers on multiple platforms. On the group level, by analyzing the characteristics of massive amounts of multi-source data, government departments can make reasonable political decisions, and researchers can achieve novel findings. Based on the above observations, the intelligent decision-making model with multi-source data as the core gradually replaces the traditional artificial decision-making mode. This chapter discusses the background of knowledge discovery from multi-source data. In Sect. 1.1, we analyze the multi-source data quality to motivate the necessity of discovering useful information from noisy sources. In Sect. 1.2, we summarize the existing studies and explore the drawbacks. We conclude the chapter with an overview of the structure of this book in Sect. 1.3.
Chen Ye, Hongzhi Wang, Guojun Dai
##### Chapter 2. Functional-Dependency-Based Truth Discovery for Isomorphic Data
Abstract
It is unavoidable that errors occur in databases. Reasons include recording errors, stale data, and even intentional errors. Such mistakes may cause serious consequences. It is impossible to correct those errors manually at scale. In fact, it is hard for people to even detect errors. However, since errors often occur rather randomly, they may cause inconsistencies within a database and conflicts among multiple databases from different sources. These inconsistencies and conflicts are easy to detect, but hard to repair. In this chapter, we first discuss two directions of work dealing with these inconsistencies and conflicts, namely data repairing and truth discovery. Then we introduce the idea of conducting functional-dependency-based truth discovery over multi-source data [1], which takes the advantages of both data repairing and truth discovery. Specifically, Sect. 2.1 discusses how existing methods resolve conflicts and inconsistencies and then motivates our approach. Section 2.2 defines the functional-dependency-based truth discovery problem, i.e., multi-source data repairing problem. Section 2.3 describes the overall framework and the details of each component in the framework, followed by a brief summary in Sect. 2.4.
Chen Ye, Hongzhi Wang, Guojun Dai
##### Chapter 3. Denial-Constraint-Based Truth Discovery for Isomorphic Data
Abstract
Aggregating accurate information from multi-source conflicting data is crucial. A common approach to address this problem is Voting/Averaging. However, such methods usually fail to achieve correct results, since they assume that all the sources are equally reliable. In most cases, the information quality usually varies a lot among diversified sources, due to the existence of different levels of errors such as recording errors, outdated data , and even intentional errors in each source. Based on the above observation, a research topic named truth discovery has been proposed. Considering relations among entities and attributes are commonly existing in the real-world applications, in this chapter, we introduce the constrained truth discovery problem [1]. We incorporate denial constraints, a universally quantified first-order logic formalism which can express a large number of effective and widely existing relations among entities, into the process of truth discovery. Specifically, we give a motivate example and define the problem in Sects. 3.1 and 3.2, respectively. In Sect. 3.3, we investigate the constrained optimization problem and provide solutions to the optimization problem. Finally, we conclude this chapter in Sect. 3.4.
Chen Ye, Hongzhi Wang, Guojun Dai
##### Chapter 4. Pattern Discovery for Heterogeneous Data
Abstract
In the field of knowledge discovery for multi-source homogeneous data, for an entity, its correct value is found by resolving conflicts among multiple sources of information. However, due to missing values and inefficient entity matching, a single entity’s information is often insufficient in practical applications. This phenomenon requires pattern discovery to discover information shared by entities from a collective set of entities and then use the discovered patterns to identify the related truths. In this chapter, we introduce pattern discovery for truth discovery and formulate it as an optimization problem [1]. To solve such a problem, we propose an algorithm called $$\textsf {PatternFinder}$$ that jointly and iteratively learns the variables. We give a motivate example in Sect. 4.1 and define the problem of pattern discovery in Sect. 4.2. Section 4.3 describes the overall solution and the main component PatternFinder. We conclude the chapter with final remarks in Sect. 4.4.
Chen Ye, Hongzhi Wang, Guojun Dai
##### Chapter 5. Fact Discovery for Text Data
Abstract
Fact extraction, which aims to extract (entity, attribute, value)-tuples from massive text corpora, is crucial in text data mining. Recent approaches focus on extracting facts by mining textual patterns with semantic types, where the quality of a pattern is evaluated based on content-based criteria, such as frequency. However, these approaches overlook the dimension of pattern reliability, which reflects how likely the extracted facts are correct. As a result, a pattern of good content quality (e.g., high frequency) may still extract incorrect facts. In this chapter, we consider both pattern reliability and fact trustworthiness in addressing the pattern-based fact extraction problem [1]. We give a motive example and the problem definition in Sects. 5.1 and 5.2, respectively. We detail the CNN-LSTM model design and present the experimental results in Sect. 5.3. Next, we conclude in Sect. 5.4.
Chen Ye, Hongzhi Wang, Guojun Dai
Title
Knowledge Discovery from Multi-Sourced Data
Authors
Dr. Chen Ye
Prof. Hongzhi Wang
Prof. Guojun Dai