Top

2017 | Book

Handbook of Big Data Technologies

Editors: Albert Y. Zomaya, Sherif Sakr

Publisher: Springer International Publishing

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This handbook offers comprehensive coverage of recent advancements in Big Data technologies and related paradigms. Chapters are authored by international leading experts in the field, and have been reviewed and revised for maximum reader value. The volume consists of twenty-five chapters organized into four main parts. Part one covers the fundamental concepts of Big Data technologies including data curation mechanisms, data models, storage models, programming models and programming platforms. It also dives into the details of implementing Big SQL query engines and big stream processing systems. Part Two focuses on the semantic aspects of Big Data management including data integration and exploratory ad hoc analysis in addition to structured querying and pattern matching techniques. Part Three presents a comprehensive overview of large scale graph processing. It covers the most recent research in large scale graph processing platforms, introducing several scalable graph querying and mining mechanisms in domains such as social networks. Part Four details novel applications that have been made possible by the rapid emergence of Big Data technologies such as Internet-of-Things (IOT), Cognitive Computing and SCADA Systems. All parts of the book discuss open research problems, including potential opportunities, that have arisen from the rapid progress of Big Data technologies and the associated increasing requirements of application domains.
Designed for researchers, IT professionals and graduate students, this book is a timely contribution to the growing Big Data field. Big Data has been recognized as one of leading emerging technologies that will have a major contribution and impact on the various fields of science and varies aspect of the human society over the coming decades. Therefore, the content in this book will be an essential tool to help readers understand the development and future of the field.

Frontmatter

Fundamentals of Big Data Processing

Frontmatter

Big Data Storage and Data Models

Abstract

Data and storage models are the basis for big data ecosystem stacks. While storage model captures the physical aspects and features for data storage, data model captures the logical representation and structures for data processing and management. Understanding storage and data model together is essential for understanding the built-on big data ecosystems. In this chapter we are going to investigate and compare the key storage and data models in the spectrum of big data frameworks.

Dongyao Wu, Sherif Sakr, Liming Zhu

Big Data Programming Models

Abstract

Big Data programming models represent the style of programming and present the interfaces paradigm for developers to write big data applications and programs. Programming models normally the core feature of big data frameworks as they implicitly affects the execution model of big data processing engines and also drives the way for users to express and construct the big data applications and programs. In this chapter, we comprehensively investigate different programming models for big data frameworks with comparison and concrete code examples.

Dongyao Wu, Sherif Sakr, Liming Zhu

Programming Platforms for Big Data Analysis

Abstract

Big data analysis imposes new challenges and requirements on programming support. Programming platforms need to provide new abstractions and run time techniques with key features like scalability, fault tolerance, efficient task distribution, usability and processing speed. In this chapter, we first provide a comprehensive survey of the requirements, give an overview and classify existing big data programming platforms based on different dimensions. Then, we present details of the architecture, methodology and features of major programming platforms like MapReduce, Storm, Spark, Pregel, GraphLab, etc. Last, we compare existing big data platforms, discuss the need for a unifying framework, present our proposed framework MatrixMap, and give a vision about future work.

Jiannong Cao, Shailey Chawla, Yuqi Wang, Hanqing Wu

Big Data Analysis on Clouds

Abstract

The huge amount of data generated, the speed at which it is produced, and its heterogeneity in terms of format, represent a challenge to the current storage, process and analysis capabilities. Those data volumes, commonly referred as Big Data, can be exploited to extract useful information and to produce helpful knowledge for science, industry, public services and in general for humankind. Big Data analytics refer to advanced mining techniques applied to Big Data sets. In general, the process of knowledge discovery from Big Data is not so easy, mainly due to data characteristics, as size, complexity and variety, that require to address several issues. Cloud computing is a valid and cost-effective solution for supporting Big Data storage and for executing sophisticated data mining applications. Big Data analytics is a continuously growing field, so novel and efficient solutions (i.e., in terms of platforms, programming tools, frameworks, and data mining algorithms) spring up everyday to cope with the growing scope of interest in Big Data. This chapter discusses models, technologies and research trends in Big Data analysis on Clouds. In particular, the chapter presents representative examples of Cloud environments that can be used to implement applications and frameworks for data analysis, and an overview of the leading software tools and technologies that are used for developing scalable data analysis on Clouds.

Loris Belcastro, Fabrizio Marozzo, Domenico Talia, Paolo Trunfio

Data Organization and Curation in Big Data

Abstract

This chapter covers advanced techniques in Big Data analytics and query processing. As the data is getting bigger and, at the same time, workloads and analytics are getting more complex, the advances in big data applications are no longer hindered by their ability to collect or generate data. But instead, by their ability to efficiently and effectively manage the available data. Therefore, numerous scalable and distributed infrastructures have been proposed to manage big data. However, it is well known in literature that scalability and distributed processing alone are not enough to achieve high performance. Instead, the underlying infrastructure has to be highly optimized for various types of workloads and query classes. These optimizations typically start from the lowest layer of the data management stack, which is the storage layer. In this chapter, we will cover two well-known techniques for optimized storage and organization of data that have big influence on query performance, namely the indexing, and data layout techniques. However, in the cases of non-traditional workloads where queries have special execution and data-access characteristics, the standard indexing and layout techniques may fall short in providing the desired performance goals. Therefore, further optimizations specific to the workload characteristics can be applied. In this chapter, we will cover techniques addressing several of these non-traditional workloads in the context of big data. Some of these techniques rely on curating either the data or the workflows (or both) with useful metadata information. This curation information can be very valuable for both query optimization and the business logic. In this chapter, we will cover the curation and metadata management of big data in query optimization and different systems. In this chapter, we focus on the MapReduce-like infrastructures, more specifically its open-source implementation Hadoop. The chapter covers the state-of-art in big data indexing techniques, and the data layout and organization strategies to speedup queries. It will also cover advanced techniques for enabling non-traditional workloads in Hadoop. Hadoop is primarily designed for workloads that are characterized by being batch, offline, ad-hoc, and disk-based. Yet, this chapter will cover recent projects and techniques targeting non-traditional workloads such as continuous query evaluation, main-memory processing, and recurring workloads. In addition, the chapter covers recent techniques proposed for data curation and efficient metadata management in Hadoop. These techniques vary from being semantic specific, e.g., provenance tracking techniques, to generic frameworks for data curation and annotation.

Mohamed Y. Eltabakh

Big Data Query Engines

Abstract

Big data analytics are techniques that are used to analyze large datasets in order to extract patterns, trends, correlations and summaries. Analytics are used in several big data applications ranging from the generation of simple reports to running deep and complex query workloads. The insights drawn by running big data analytics depend primarily on the capabilities of the underlying query engine, which is responsible for translating user queries into efficient data retrieval and processing operations, as well as executing these operations on one or multiple nodes in order to find query answers. Classically, parallel database systems have been adopted in various domains, particularly enterprise data warehouses, as the data processing platform for running big data analytics. An SQL-based query engine, running on a shared-nothing cluster, is typically used by these platforms. Scalability is realized by partitioning data across multiple machines that communicate via a high speed interconnect layer. These systems often rely on dedicated expensive hardware resources in order to scale-out query processing and provide fault tolerance. With the emergence of Hadoop, it became possible to use cheap commodity hardware for achieving linear scalability and fault tolerance. A typical Hadoop environment involves a software stack running in one ecosystem, while sharing hardware resources across different systems, called tenants. Earlier Hadoop query engines leveraged programming frameworks such as MapReduce to run analytics using programs executed on a distributed file system. The Hadoop Distributed File System (HDFS) has been effectively used for batch processing of simple analytics. The need for coding and manual optimization of analytics, the lack of support to complex queries and the limited interactive processing capabilities, have triggered the need for adopting new technologies with more expressive query languages and advanced query processing techniques. Integrating parallel database systems into Hadoop ecosystem is an obvious approach to combine the advantages of both worlds. In this respect, multiple challenges needed to be addressed to fit a parallel database query engine in Hadoop software stack. Data placement, query optimization, query execution and resource management are some of the technical problems that are actively studied in this area. In this chapter, we discuss the state-of-the-art of query engines in parallel database systems, Hadoop-based systems, as well as the hybrid systems that integrate parallel databases and Hadoop technologies. We present the architectures of multiple example systems and highlight their similarity and differences. We also give an overview of the research problems and proposed techniques in the areas of query optimization and execution.

Mohamed A. Soliman

Large-Scale Data Stream Processing Systems

Abstract

In our data-centric society, online services, decision making, and other aspects are increasingly becoming heavily dependent on trends and patterns extracted from data. A broad class of societal-scale data management problems requires system support for processing unbounded data with low latency and high throughput. Large-scale data stream processing systems perceive data as infinite streams and are designed to satisfy such requirements. They have further evolved substantially both in terms of expressive programming model support and also efficient and durable runtime execution on commodity clusters. Expressive programming models offer convenient ways to declare continuous data properties and applied computations, while hiding details on how these data streams are physically processed and orchestrated in a distributed environment. Execution engines provide a runtime for such models further allowing for scalable yet durable execution of any declared computation. In this chapter we introduce the major design aspects of large scale data stream processing systems, covering programming model abstraction levels and runtime concerns. We then present a detailed case study on stateful stream processing with Apache Flink, an open-source stream processor that is used for a wide variety of processing tasks. Finally, we address the main challenges of disruptive applications that large-scale data streaming enables from a systemic point of view.

Paris Carbone, Gábor E. Gévay, Gábor Hermann, Asterios Katsifodimos, Juan Soto, Volker Markl, Seif Haridi

Semantic Big Data Management

Frontmatter

Semantic Data Integration

Abstract

The growing volume, variety and complexity of data being collected for scientific purposes presents challenges for data integration. For data to be truly useful, scientists need not only to be able to access it, but also be able to interpret and use it. Doing this requires semantic context. Semantic Data Integration is an active field of research, and this chapter describes the current challenges and how existing approaches are addressing them. The chapter then provides an overview of several active research areas within the semantic data integration field, including interactive and collaborative schema matching, integration of geospatial and biomedical data, and visualization of the data integration process. Finally, the need to move beyond the discovery of simple 1-to-1 equivalence matches to the identification of more complex relationships across datasets is presented and possible first steps in this direction are discussed.

Michelle Cheatham, Catia Pesquita

Linked Data Management

Abstract

The size of Linked Data is growing exponentially, thus a Linked Data management system has to be able to deal with increasing amounts of data. Additionally, in the Linked Data context, variety is especially important. In spite of its seemingly simple data model, Linked Data actually encodes rich and complex graphs mixing both instance and schema-level data. Since Linked Data is schema-free (i.e., the schema is not strict), standard databases techniques cannot be directly adopted to manage it. Even though organizing Linked Data in a form of a table is possible, querying a giant triple table becomes very costly due to the multiple nested joins required typical queries. The heterogeneity of Linked Data poses also entirely new challenges to database systems, where managing provenance information is becoming a requirement. Linked Data queries usually include multiple sources and results can be produced in various ways for a specific scenario. Such heterogeneous data can incorporate knowledge on provenance, which can be further leveraged to provide users with a reliable and understandable description of the way the query result was derived, and improve the query execution performance due to high selectivity of provenance information. In this chapter, we provide a detailed overview of current approaches specifically designed for Linked Data management. We focus on storage models, indexing techniques, and query execution strategies. Finally, we provide an overview of provenance models, definitions, and serialization techniques for Linked Data. We also survey the database management systems implementing techniques to manage provenance information in the context of Linked Data.

Manfred Hauswirth, Marcin Wylot, Martin Grund, Paul Groth, Philippe Cudré-Mauroux

Non-native RDF Storage Engines

Abstract

The proliferation of heterogeneous Linked Data requires data management systems to constantly improve their scalability and efficiency. Linked Data can be stored according to many different data storage models. Some of these attempt to use general purpose database storage techniques to persist Linked Data, hence they can leverage existing data processing environments (e.g., big Hadoop clusters). We therefore look at the multiplicity of Linked Data storage systems which we categorize into the following classes: relational database-based systems, NoSQL-based systems, massively parallel systems.

Manfred Hauwirth, Marcin Wylot, Martin Grund, Sherif Sakr, Phillippe Cudré-Mauroux

Exploratory Ad-Hoc Analytics for Big Data

Abstract

In a traditional relational database management system, queries can only be defined over attributes defined in the schema, but are guaranteed to give single, definitive answer structured exactly as specified in the query. In contrast, an information retrieval system allows the user to pose queries without knowledge of a schema, but the result will be a top-k list of possible answers, with no guarantees about the structure or content of the retrieved documents. In this chapter, we present Drill Beyond, a novel IR/RDBMS hybrid system, in which the user seamlessly queries a relational database together with a large corpus of tables extracted from a web crawl. The system allows full SQL queries over a relational database, but additionally enables the user to use arbitrary additional attributes in the query that need not to be defined in the schema. The system then processes this semi-specified query by computing a top-k list of possible query evaluations, each based on different candidate web data sources, thus mixing properties of two worlds RDBMS and IR systems.

Julian Eberius, Maik Thiele, Wolfgang Lehner

Pattern Matching Over Linked Data Streams

Abstract

This chapter leverages semantic technologies, such as Linked Data, which can facilitate machine-to-machine (M2M) communications to build an efficient information dissemination system for semantic IoT. The system integrates Linked Data streams generated from various data collectors and disseminates matched data to relevant data consumers based on triple pattern queries registered in the system by the consumers. We also design two new data structures, TP-automata and CTP-automata, to meet the high performance needs of Linked Data dissemination. We evaluate our system using a real-world dataset generated from a Smart Building Project. With the new data structures, the proposed system can disseminate Linked Data faster than the existing approach with thousands of registered queries.

Yongrui Qin, Quan Z. Sheng

Searching the Big Data: Practices and Experiences in Efficiently Querying Knowledge Bases

Abstract

Knowledge bases (KBs) are computer systems that store complex structured and unstructured facts, i.e., knowledge. KB are described as open shared database of the world’s knowledge and typically use the entity-relational model. Most of the existing knowledge bases make their data in the RDF format. Tools including querying, inferencing and reasoning on facts are developed to consume the knowledge. In this chapter, we introduce a client-side caching framework aiming at accelerating the overall query response speed. In particular, we improve a suboptimal graph edit distance function to estimate the similarity of SPARQL queries and develop an approach to transform the SPARQL queries to feature vectors. Machine learning algorithms are leveraged using these feature vectors to identify similar queries that could potentially be the subsequent queries. We adapt multiple dimensional reduction algorithms to reduce the identification time. We then prefetch and cache the results of these queries aiming to improve the overall querying performance. We also develop a forecasting method, namely Modified Simple Exponential Smoothing, to implement the cache replacement. Our approach has been evaluated by using a very large set of real world queries. The empirical results show that our approach has great potential to enhance the cache hit rate and accelerate the querying speed on SPARQL endpoints.

Wei Emma Zhang, Quan Z. Sheng

Big Graph Analytics

Frontmatter

Management and Analysis of Big Graph Data: Current Systems and Open Challenges

Abstract

Many big data applications in business and science require the management and analysis of huge amounts of graph data. Suitable systems to manage and to analyze such graph data should meet a number of challenging requirements including support for an expressive graph data model with heterogeneous vertices and edges, powerful query and graph mining capabilities, ease of use as well as high performance and scalability. In this chapter, we survey current system approaches for management and analysis of “big graph data”. We discuss graph database systems, distributed graph processing systems such as Google Pregel and its variations, and graph dataflow approaches based on Apache Spark and Flink. We further outline a recent research framework called Gradoop that is build on the so-called Extended Property Graph Data Model with dedicated support for analyzing not only single graphs but also collections of graphs. Finally, we discuss current and future research challenges.

Martin Junghanns, André Petermann, Martin Neumann, Erhard Rahm

Similarity Search in Large-Scale Graph Databases

Abstract

Graphs are ubiquitous and play an essential role in modeling and representing complex structures in real-world networked applications. Given a graph database that comprises a large collection of graphs, it is fundamental and critical to enable fast and flexible search for structurally similar graphs. In this paper, we survey recent graph similarity search techniques and specifically focus on the work based on the graph edit distance (GED) metric. State-of-the-art approaches for the GED based similarity search typically adopt a pruning and verification framework. They first take advantage of some easy-to-compute lower-bounds of graph edit distance, and use novel graph indexing structures to efficiently evaluate such lower-bounds between graphs in the graph database and the query graph. This way, graphs that violate the GED lower-bound constraints can be identified and filtered from the graph database from further investigation. Then, the costly GED verification is performed only for the graphs that pass the GED lower-bound evaluation. We examine existing GED lower-bounds, graph index structures, and similarity search algorithms in detail, and compare different similarity search methods from multiple aspects including index construction cost, similarity search performance, and applicability in real-world graph databases. In the end, we envision and discuss the future research directions related to similarity search and high-performance query processing in large-scale graph databases.

Peixiang Zhao

Big-Graphs: Querying, Mining, and Beyond

Abstract

Graphs are a ubiquitous model to represent objects and their relations. However, the complex combinations of structure and content, coupled with massive volume, high streaming rate, and uncertainty inherent in the data, raise several challenges that require new efforts for smarter and faster graph analysis. With the advent of complex networks such as the World Wide Web, social networks, knowledge graphs, genome and scientific databases, Internet of things, medical and government records, novel graph computations are also emerging, including graph pattern matching and mining, similarity search, keyword search, and graph query-by-example. These workloads require both topology and content information of the network; and hence, they are different from classical graph computations such as shortest path, reachability, and minimum cut, which depend only on the structure of the network. In this chapter, we shall describe the emerging graph queries and mining problems, their applications and resolution techniques. We emphasize the current challenges and highlight some future research directions.

Arijit Khan, Sayan Ranu

Link and Graph Mining in the Big Data Era

Abstract

Graphs are a convenient representation for large sets of data, being complex networks, social networks, publication networks, and so on. The growing volume of data modeled as complex networks, e.g. the World Wide Web, and social networks like Twitter, Facebook, has raised a new area of research focused in complex networks mining. In this new multidisciplinary area, it is possible to highlight some important tasks: extraction of statistical properties, community detection, link prediction, among several others. This new approach has been driven largely by the growing availability of computers and communication networks, which allow us to gather and analyze data on a scale far larger than previously possible. In this chapter we will give an overview of several graph mining approach to mine and handle large complex networks.

Ana Paula Appel, Luis G. Moyano

Granular Social Network: Model and Applications

Abstract

Social networks are becoming an integral part of the modern society. Popular social network applications like Facebook, Twitter produces data in huge scale. These data shows all the characteristic of Big data. Accordingly, it leads to a deep change in the way social networks were being analyzed. The chapter describes a model of social network and its applications within the purview of information diffusion and community structure in network analysis. Here fuzzy granulation theory is used to model uncertainties in social networks. This provides a new knowledge representation scheme of relational data by taking care of the indiscernibility among the actors as well as the fuzziness in their relations. Various measures of network are defined on this new model. Within the context of this knowledge framework of social network, algorithms for target set selection and community detection are developed. Here the target sets are determined using the new measure granular degree, whereas it is granular embeddedness, together with granular degree, which is used for detecting various overlapping communities. The resulting community structures have a fuzzy-rough set theoretic description which allows a node to be a member of multiple communities with different memberships of association only if it falls in the (rough upper - rough lower) approximate region. A new index, called normalized fuzzy mutual information is introduced which can be used to quantify the similarity between two fuzzy partition matrices, and hence the quality of the communities detected. Comparative studies demonstrating the superiority of the model over graph theoretic model is shown through extensive experimental results.

Sankar K. Pal, Suman Kundu

Big Data Applications

Frontmatter

Big Data, IoT and Semantics

Abstract

Big data and the Internet of things are two parallel universes, but they are so close that in most cases they blend together. The amount of devices that connect to the internet grows day by day and they bring millions of data. The IoT generates unprecedented amounts of data and this impacts on the entire big data universe. The IoT and big data are clearly growing apace, and are set to transform many areas of business and everyday life. Semantic technologies play a fundamental role in reducing incompatibilities among data formats and providing an additional layer on which applications can be built, to reason over data and extract new meaningful information. In this chapter we report the most common approaches adopted in dealing with Big Data and IoT problems and explore some of the semantic based solutions which address such problematics.

Beniamino di Martino, Giuseppina Cretella, Antonio Esposito

SCADA Systems in the Cloud

Abstract

SCADA (Supervisory Control And Data Acquisition) systems allow users to monitor (using sensors) and control (using actuators) an industrial system remotely. Larger SCADA systems can support several 100,000 sensors, sending and storing hundreds of thousands of messages per second, generating large amounts of data. As these systems are critical to industrial processes, they are often run on highly reliable and dedicated hardware. This is in contrast to the current state of computing, which is moving from running applications on internally hosted servers to cheaper, internal or external cloud environments. Clouds can benefit SCADA users by providing the storage and processing power to analyse the collected data. The goal of this chapter is twofold; provide an introduction to techniques for migrating SCADA to clouds, and devise a conceptual system which supports the process of migrating a SCADA application to a cloud resource while fulfilling key SCADA requirements (such as; support for big data storage).

Philip Church, Harald Mueller, Caspar Ryan, Spyridon V. Gogouvitis, Andrzej Goscinski, Houssam Haitof, Zahir Tari

Quantitative Data Analysis in Finance

Abstract

Quantitative tools have been widely adopted in order to extract the massive information from a variety of financial data. Mathematics, statistics and computers algorithms have never been so important to financial practitioners in history. Investment banks develop equilibrium models to evaluate financial instruments; mutual funds applied time series to identify the risks in their portfolio; and hedge funds hope to extract market signals and statistical arbitrage from noisy market data. The rise of quantitative finance in the last decade relies on the development of computer techniques that make processing large datasets possible. As more data is available at a higher frequency, more researches in quantitative finance have switched to the microstructures of financial market. High frequency data is a typical example of big data that is characterized by the 3V’s: velocity, variety and volume. In addition, the signal-to-noise ratio in financial time series is usually very small. High frequency datasets are more likely to be exposed to extreme values, jumps and errors than the low frequency ones. Specific data processing techniques and quantitative models are elaborately designed to extract information from financial data efficiently. In this chapter, we present the quantitative data analysis approaches in finance. First, we review the development of quantitative finance in the past decade. Then we discuss the characteristics of high frequency data and the challenges it brings. The quantitative data analysis consists of two basic steps: (i) data cleaning and aggregating; (ii) data modeling. We review the mathematics tools and computing technologies behind the two steps. The valuable information extracted from raw data is represented by a group of statistics. The most widely used statistics in finance are expected return and volatility, which are the fundamentals of modern portfolio theory. We further introduce some simple portfolio optimization strategies as an example of the application of financial data analysis. Big data has already changed financial industry fundamentally; while quantitative tools for addressing massive financial data still have a long way to go. Adoptions of advanced statistics, information theory, machine learning and faster computing algorithms are inevitable in order to predict complicated financial markets. These topics are briefly discussed in the later part of this chapter.

Xiang Shi, Peng Zhang, Samee U. Khan

Emerging Cost Effective Big Data Architectures

Abstract

Volume, velocity and variety of data is increasing at an unprecedented rate. There is a growing consensus that a single system cannot cater to the variety of workloads and real world datasets. As such, different solutions are being researched and developed catering for requirements of different applications. For example, column-stores are optimized specifically for data warehousing applications, whereas row-stores are better suited for transactional workloads. There are also hybrid systems for applications that need support for both transactional workloads and data analytics. Other varied systems are being designed and built to store different types of data, such as document data stores for storing XML or JSON documents, and graph databases for graph-structured or RDF data. Most of these systems focus on minimization of execution time or performance improvement and often ignore optimization of overall cost of data management. A more holistic view of the cost of data management includes energy consumption, and utilization of compute, memory and storage resources which attribute to the cost of data processing especially in a cloud-based pay-as-you-go environments. In this chapter, we discuss a new area of emerging Big Data Architectures that aim at minimization of overall cost of data storage, querying and analysis, while improving performance. We first provide a motivation for the overall problem, with appropriate related work. We then discuss the state-of-the-art and provide key case studies of the emerging cost effective big data architectures that have been recently designed and built with the above mentioned goals in mind. Finally, we enumerate key future directions and conclude.

K. Ashwin Kumar

Bringing High Performance Computing to Big Data Algorithms

Abstract

Many ideas of High Performance Computing are applicable to Big Data problems. The more so now, that hybrid, GPU computing gains traction in mainstream computing applications. This work discusses the differences between the High Performance Computing software stack and the Big Data software stack and then focuses on two popular computing workloads, the Alternating Least Squares algorithm and the Singular Value Decomposition, and shows how their performance can be maximized using hybrid computing techniques.

H. Anzt, J. Dongarra, M. Gates, J. Kurzak, P. Luszczek, S. Tomov, I. Yamazaki

Cognitive Computing: Where Big Data Is Driving Us

Abstract

In this chapter we will discuss the concepts and challenges to design Cognitive Systems. Cognitive Computing is the use of computational learning systems to augment cognitive capabilities in solving real world problems. Cognitive systems are designed to draw inferences from data and pursue the objectives they were given. The era of big data is the basis for innovative cognitive solutions that cannot rely on traditional systems. While traditional computers must be programmed by humans to perform specific tasks, cognitive systems will learn from their interactions with data and humans. Not only is Cognitive Computing a fundamentally new computing paradigm for tackling real world problems, exploiting enormous amounts of data using massively parallel machines, but also it engenders a new form of interaction between humans and computers. As machines start to enhance human cognition and help people make better decisions, new issues arise for research. We will address these questions for Cognitive Systems: What are the needs? Where to apply? Which are the sources of information to relying on?

Ana Paula Appel, Heloisa Candello, Fábio Latuf Gandour

Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges

Abstract

The growth of Big Data, especially personal data dispersed in multiple data sources, presents enormous opportunities and insights for businesses to explore and leverage the value of linked and integrated data. However, privacy concerns impede sharing or exchanging data for linkage across different organizations. Privacy-preserving record linkage (PPRL) aims to address this problem by identifying and linking records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these entities. PPRL is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national security. PPRL for Big Data poses several challenges, with the three major ones being (1) scalability to multiple large databases, due to their massive volume and the flow of data within Big Data applications, (2) achieving high quality results of the linkage in the presence of variety and veracity of Big Data, and (3) preserving privacy and confidentiality of the entities represented in Big Data collections. In this chapter, we describe the challenges of PPRL in the context of Big Data, survey existing techniques for PPRL, and provide directions for future research.

Dinusha Vatsalan, Ziad Sehili, Peter Christen, Erhard Rahm

Title: Handbook of Big Data Technologies
Editors: Albert Y. Zomaya
Sherif Sakr
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-49340-4
Print ISBN: 978-3-319-49339-8
DOI: https://doi.org/10.1007/978-3-319-49340-4

Springer Professional

Handbook of Big Data Technologies

About this book

Table of Contents

Frontmatter

Fundamentals of Big Data Processing

Frontmatter

Big Data Storage and Data Models

Big Data Programming Models

Programming Platforms for Big Data Analysis

Big Data Analysis on Clouds

Data Organization and Curation in Big Data

Big Data Query Engines

Large-Scale Data Stream Processing Systems

Semantic Big Data Management

Frontmatter

Semantic Data Integration

Linked Data Management

Non-native RDF Storage Engines

Exploratory Ad-Hoc Analytics for Big Data

Pattern Matching Over Linked Data Streams

Searching the Big Data: Practices and Experiences in Efficiently Querying Knowledge Bases

Big Graph Analytics

Frontmatter

Management and Analysis of Big Graph Data: Current Systems and Open Challenges

Similarity Search in Large-Scale Graph Databases

Big-Graphs: Querying, Mining, and Beyond

Link and Graph Mining in the Big Data Era

Granular Social Network: Model and Applications

Big Data Applications

Frontmatter

Big Data, IoT and Semantics

SCADA Systems in the Cloud

Quantitative Data Analysis in Finance

Emerging Cost Effective Big Data Architectures

Bringing High Performance Computing to Big Data Algorithms

Cognitive Computing: Where Big Data Is Driving Us

Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges

Premium Partner