Skip to main content
Top

2019 | Book

Linking and Mining Heterogeneous and Multi-view Data

insite
SEARCH

About this book

This book highlights research in linking and mining data from across varied data sources. The authors focus on recent advances in this burgeoning field of multi-source data fusion, with an emphasis on exploratory and unsupervised data analysis, an area of increasing significance with the pace of growth of data vastly outpacing any chance of labeling them manually. The book looks at the underlying algorithms and technologies that facilitate the area within big data analytics, it covers their applications across domains such as smarter transportation, social media, fake news detection and enterprise search among others. This book enables readers to understand a spectrum of advances in this emerging area, and it will hopefully empower them to leverage and develop methods in multi-source data fusion and analytics with applications to a variety of scenarios.

Includes advances on unsupervised, semi-supervised and supervised approaches to heterogeneous data linkage and fusion; Covers use cases of analytics over multi-view and heterogeneous data from across a variety of domains such as fake news, smarter transportation and social media, among others;

Provides a high-level overview of advances in this emerging field and empowers the reader to explore novel applications and methodologies that would enrich the field.

Table of Contents

Frontmatter
Chapter 1. Multi-View Data Completion
Abstract
Multi-view learning has been explored in various applications such as bioinformatics, natural language processing and multimedia analysis. Often multi-view learning methods commonly assume that full feature matrices or kernel matrices for all views are available. However, in partial data analytics, it is common that information from some sources is not available or missing for some data-points. Such lack of information can be categorized into two types. (1) Incomplete view: information of a data-point is partially missing in some views. (2) Missing view: information of a data-point is entirely missing in some views, but information for that data-point is fully available in other views (no partially missing data-point in a view).
Although multi-view learning in the presence of missing data has drawn a great amount of attention in the recent past and there are quite a lot of research papers on multi-view data completion, but there is no comprehensive introduction and review of current approaches on multi-view data completion. We address this gap in this chapter through describing the multi-view data completion methods.
In this chapter, we will mainly discuss existing methods to deal with missing view problem. We describe a simple taxonomy of the current approaches. And for each category, representative as well as newly proposed models are presented. We also attempt to identify promising avenues and point out some specific challenges which can hopefully promote further research in this rapidly developing field.
Sahely Bhadra
Chapter 2. Multi-View Clustering
Abstract
With a plethora of data capturing modalities becoming available, the same data object often leaves different kinds of digital footprints. This naturally leads to datasets comprising the same set of data objects represented in different forms, called multi-view data. Among the most fundamental tasks in unsupervised learning is that of clustering, the task of grouping data objects into groups of related objects. Multi-view clustering (MVC) is a flourishing field in unsupervised learning; the MVC task considers leveraging multiple views of data objects in order to arrive at a more effective and accurate grouping than what can be achieved by just using one view of data. Multi-view clustering methods differ in the kind of modelling they use in order to fuse multiple views, by managing the synergies, complimentarities, and conflicts across data views, and arriving at a single clustering output across the multiple views in the dataset. This chapter provides a survey of a sample of multi-view clustering methods, with an emphasis on bringing out the wide diversity in solution formulations that have been considered. We pay specific attention to enable the reader understand the intuition behind each method ahead of describing the technical details of the method, to ensure that the survey is accessible to readers who may not be machine learning specialists. We also outline some popular datasets that have been used to empirically evaluate MVC methods.
Deepak P, Anna Jurek-Loughrey
Chapter 3. Semi-supervised and Unsupervised Approaches to Record Pairs Classification in Multi-Source Data Linkage
Abstract
Data integration has become one of the main challenges in the era of Big Data analytics. Often to enable decision-making, data from different sources have to be integrated and linked together. For example, multi-source data integration is vital to police, counter terrorism and national security to allow efficient and accurate verification of people. One of the key challenges in the data integration process is matching records that represent the same real-world entity (e.g. person). This process is referred to as record linkage. In many cases, data sets do not share a unique identifier (e.g. National Insurance Number), hence records need to be matched by comparing their corresponding attributes. Most of the existing record linkage methods require assistance from a domain expert for handcrafting domain-specific linking rules. More automatic approaches, based on using machine learning, were also proposed. However, those approaches relay on having a substantial set of manually labelled records, which makes them inapplicable in real-world scenarios. Given the importance of the problem, record linkage has witnessed a strong interest in the past decade. As a result, significant progress has been made in this area. In particular, the problem of reducing the manual effort and the amount of labelled data required for constructing record linkage models has been addressed in many studies. In this chapter, we review the most recently proposed approaches to semi-supervised and unsupervised record linkage.
Anna Jurek-Loughrey, Deepak P
Chapter 4. A Review of Unsupervised and Semi-supervised Blocking Methods for Record Linkage
Abstract
Record linkage, referred to also as entity resolution, is a process of identifying records representing the same real-world entity (e.g. a person) across varied data sources. To reduce the computational complexity associated with record comparisons, a task referred to as blocking is commonly performed prior to the linkage process. The blocking task involves partitioning records into blocks of records and treating records from different blocks as not related to the same entity. Following this, record linkage methods are applied within each block significantly reducing the number of record comparisons. Most of the existing blocking techniques require some degree of parameter selection in order to optimise the performance for a particular dataset (e.g. attributes and blocking functions used for splitting records into blocks). Optimal parameters can be selected manually but this is expensive in terms of time and cost and assumes a domain expert to be available. Automatic supervised blocking techniques have been proposed; however, they require a set of labelled data in which the matching status of each record is known. In the majority of real-world scenarios, we do not have any information regarding the matching status of records obtained from multiple sources. Therefore, there is a demand for blocking techniques that sufficiently reduce the number of record comparisons with little to no human input or labelled data required. Given the importance of the problem, recent research efforts have seen the development of novel unsupervised and semi-supervised blocking techniques. In this chapter, we review existing blocking techniques and discuss their advantages and disadvantages. We detail other research areas that have recently arose and discuss other unresolved issues that are still to be addressed.
Kevin O’Hare, Anna Jurek-Loughrey, Cassio de Campos
Chapter 5. Traffic Sensing and Assessing in Digital Transportation Systems
Abstract
By integrating relevant vision technologies, based on multiview data and parsimonious models, into the transportation system’s infrastructure and in vehicles themselves, the main transportation problems can be alleviated and road safety improved along with an increase in economic productivity. This new cooperative environment integrates networking, electronic, and computing technologies, will enable safer roads, and achieve more efficient mobility and minimize the environmental impact. It is within this context of digital transportation systems that this chapter attempts to review the main concepts of intelligent road traffic management. We begin by summarizing the most best-known vehicle recording and counting devices, the major interrelated transportation problems, especially the congestion and pollution. The main physical variables governing the urban traffic and factors responsible for transportation problems as well as the common assessing methodologies are overviewed. Graphics and real-life shots are occasionally used to clearly depict the reported concepts. Then, in direct relation to the recent literature on surveillance based on computer vision and image processing, the most efficient counting techniques published over the few last years are reviewed and commented. Their few drawbacks are underlined and the prospects for improvement are briefly expressed. This chapter could be used not only as a pedagogical guide, but also as a practical reference which explains efficient implementing of traffic management systems into new smart cities.
Hana Rabbouch, Foued Saâdaoui, Rafaa Mraihi
Chapter 6. How Did the Discussion Go: Discourse Act Classification in Social Media Conversations
Abstract
Over the last two decades, social media has emerged as almost an alternate world where people communicate with each other and express opinions about almost anything. This makes platforms like Facebook, Reddit, Twitter, Myspace, etc., a rich bank of heterogeneous data, primarily expressed via text but reflecting all textual and non-textual data that human interaction can produce. We propose a novel attention-based hierarchical LSTM model to classify discourse act sequences in social media conversations, aimed at mining data from online discussion using textual meanings beyond sentence level. The very uniqueness of the task is the complete categorization of possible pragmatic roles in informal textual discussions, contrary to extraction of question–answers, stance detection, or sarcasm identification which are very much role specific tasks. Early attempt was made on a Reddit discussion dataset. We train our model on the same data, and present test results on two different datasets, one from Reddit and one from Facebook. Our proposed model outperformed the previous one in terms of domain independence; without using platform-dependent structural features, our hierarchical LSTM with word relevance attention mechanism achieved F1-scores of 71% and 66%, respectively, to predict discourse roles of comments in Reddit and Facebook discussions. Efficiency of recurrent and convolutional architectures in order to learn discursive representation on the same task has been presented and analyzed, with different word and comment embedding schemes. Our attention mechanism enables us to inquire into relevance ordering of text segments according to their roles in discourse. We present a human annotator experiment to unveil important observations about modeling and data annotation. Equipped with our text-based discourse identification model, we inquire into how heterogeneous non-textual features like location, time, leaning of information, etc. play their roles in characterizing online discussions on Facebook.
Subhabrata Dutta, Tanmoy Chakraborty, Dipankar Das
Chapter 7. Learning from Imbalanced Datasets with Cross-View Cooperation-Based Ensemble Methods
Abstract
In this paper, we address the problem of learning from imbalanced multi-class datasets in a supervised setting when multiple descriptions of the data—also called views—are available. Each view incorporates various information on the examples, and in particular, depending on the task at hand, each view might be better at recognizing only a subset of the classes. Establishing a sort of cooperation between the views is needed for all the classes to be equally recognized—a crucial problem particularly for imbalanced datasets. The novelty of our work consists in capitalizing on the complementariness of the views so that each class can be processed by the most appropriate view(s), thus improving the per-class performances of the final classifier. The main contribution of this paper are two ensemble learning methods based on recent theoretical works on the use of the confusion matrix’s norm as an error measure, while empirical results show the benefits of the proposed approaches.
Cécile Capponi, Sokol Koço
Chapter 8. Entity Linking in Enterprise Search: Combining Textual and Structural Information
Abstract
Fast and correct identification of named entities in queries is crucial for query understanding and to map the query to information in structured knowledge base. Most of the existing works have focused on utilizing search logs and manually curated knowledge bases for entity linking and often involve complex graph operations and are generally slow. We describe a simple, yet fast and accurate, probabilistic entity linking algorithm that can be used in enterprise settings where automatically constructed, domain-specific knowledge graphs are used. In addition to the linked graph structure, textual evidence from the domain-specific corpus is also utilized to improve the performance.
Sumit Bhatia
Chapter 9. Clustering Multi-View Data Using Non-negative Matrix Factorization and Manifold Learning for Effective Understanding: A Survey Paper
Abstract
Multi-view data that contains the data represented in many types of features has received much attention recently. The class of method utilizing non-negative matrix factorization (NMF) and manifold learning to seek the meaningful latent structure of data has been popularly used for both traditional data and multi-view data. The NMF and manifold-based multi-view clustering methods focus on dealing with the challenges of manifold learning and applying manifold learning on the NMF framework. This paper provides a comprehensive review of this important class of methods on multi-view data. We conduct an extensive experiment on several datasets and raise many open problems that can be dealt with in the future so a higher clustering performance can be achieved.
Khanh Luong, Richi Nayak
Chapter 10. Leveraging Heterogeneous Data for Fake News Detection
Abstract
Nowadays, a plenty of social media platforms are available to exchange information rapidly. Such a rapid propagation and cumulation of information form a deluge, in which it is hard to believe all the pieces of information since it appears to be very realistic. In this context, characterizing and recognizing misinformation, especially, fake news, is a highly recommended computational task. News fabrication mostly happens through the textual and visual content comprised in the news article. People spreading fake news have been intentionally modifying the content of a news with some partially true information or use fully manipulated information, newly fabricated stories, etc., which could mislead others. Fake news characterization and detection are the computational studies that focus to get rid of the highly malicious news creation and propagation. The textual and visual content-related features, temporal and propagation patterns of the network, that use traditional and deep neural computations are the methods to identify fake news generation and spread. This chapter discusses the methods to leverage heterogeneous data to curb the fake news generation and propagation. We present an extensive review of the state-of-the-art fake news detection systems, in the context of different modalities emphasizing the content-based approaches including text and image modality and also discuss briefly the network, temporal, and knowledge base approaches. This study also extends to discuss the available datasets in this area, the open issues, and future directions of research.
K. Anoop, Manjary P. Gangan, Deepak P, V. L. Lajish
Chapter 11. General Framework for Multi-View Metric Learning
Abstract
We consider the problem of metric learning for multi-view data and present a general method for learning within-view as well as between-view metrics in vector-valued kernel spaces, as a way to capture multimodal structure of the data. We formulate a general convex optimization problem in this context to jointly learn the metric and the classifier or regressor in kernel feature spaces. The formulated multi-view metric learning (MVML) can be applied to data with any number of views, not just two, while as a kernel-based method it allows for various data types. Indeed, it is not required for the views to have the same data type, as long as all of them are individually kernelizable. We give concrete realizations of our iterative algorithm in both classification and regression settings, where the metric operating between views is also learned, either a full metric or a view-sparse one. In order to scale the computation to large training sets, a block-wise Nyström approximation of the multi-view kernel matrix is introduced. We justify our approach theoretically and experimentally, and show its performance on real-world datasets against relevant state-of-the-art methods.
Riikka Huusari, Hachem Kadri, Cécile Capponi
Chapter 12. On the Evaluation of Community Detection Algorithms on Heterogeneous Social Media Data
Abstract
One fundamental problem in social networks is the identification of groups of elements (also known as communities) when group membership is not explicitly available. Community detection has proven to be valuable in diverse domains such as biology, social sciences and bibliometrics. Thus, several community detection techniques have been developed. Nonetheless, as real networks are very heterogenous, the question of how communities should be assessed remains open. Whilst there are several works that have analysed the performance of diverse community detection algorithms over artificial graph benchmarks, the evaluation over real social networks has received comparatively less attention. Motivated by the lack of such studies, this chapter focuses on the analysis of the performance of community detection algorithms over social media networks, and the quantification of the structural properties of the discovered communities.
Antonela Tommasel, Daniela Godoy
Backmatter
Metadata
Title
Linking and Mining Heterogeneous and Multi-view Data
Editors
Dr. Deepak P
Dr. Anna Jurek-Loughrey
Copyright Year
2019
Electronic ISBN
978-3-030-01872-6
Print ISBN
978-3-030-01871-9
DOI
https://doi.org/10.1007/978-3-030-01872-6