nach oben

2016 | Buch

Big Data Technologies and Applications

verfasst von: Borko Furht, Flavio Villanustre

Verlag: Springer International Publishing

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

The objective of this book is to introduce the basic concepts of big data computing and then to describe the total solution of big data problems using HPCC, an open-source computing platform.
The book comprises 15 chapters broken into three parts. The first part, Big Data Technologies, includes introductions to big data concepts and techniques; big data analytics; and visualization and learning techniques. The second part, LexisNexis Risk Solution to Big Data, focuses on specific technologies and techniques developed at LexisNexis to solve critical problems that use big data analytics. It covers the open source High Performance Computing Cluster (HPCC Systems®) platform and its architecture, as well as parallel data languages ECL and KEL, developed to effectively solve big data problems. The third part, Big Data Applications, describes various data intensive applications solved on HPCC Systems. It includes applications such as cyber security, social network analytics including fraud, Ebola spread modeling using big data analytics, unsupervised learning, and image classification.
The book is intended for a wide variety of people including researchers, scientists, programmers, engineers, designers, developers, educators, and students. This book can also be beneficial for business managers, entrepreneurs, and investors.

Inhaltsverzeichnis

Frontmatter

Big Data Technologies

Frontmatter

Chapter 1. Introduction to Big Data

Abstract

In this chapter we present the basic terms and concepts in Big Data computing. Big data is a large and complex collection of data sets, which is difficult to process using on-hand database management tools and traditional data processing applications. Big Data topics include the following activities:

Borko Furht, Flavio Villanustre

Chapter 2. Big Data Analytics

Abstract

The age of big data is now coming. But the traditional data analytics may not be able to handle such large quantities of data. The question that arises now is, how to develop a high performance platform to efficiently analyze big data and how to design an appropriate mining algorithm to find the useful things from big data. To deeply discuss this issue, this paper begins with a brief introduction to data analytics, followed by the discussions of big data analytics. Some important open issues and further research directions will also be presented for the next step of big data analytics.

Chun-Wei Tsai, Chin-Feng Lai, Han-Chieh Chao, Athanasios V. Vasilakos

Chapter 3. Transfer Learning Techniques

Abstract

Machine learning and data mining techniques have been used in numerous real-world applications. An assumption of traditional machine learning methodologies is the training data and testing data are taken from the same domain, such that the input feature space and data distribution characteristics are the same. However, in some real-world machine learning scenarios, this assumption does not hold. There are cases where training data is expensive or difficult to collect. Therefore, there is a need to create high-performance learners trained with more easily obtained data from different domains. This methodology is referred to as transfer learning. This survey paper formally defines transfer learning, presents information on current solutions, and reviews applications applied to transfer learning. Lastly, there is information listed on software downloads for various transfer learning solutions and a discussion of possible future research work. The transfer learning solutions surveyed are independent of data size and can be applied to big data environments.

Karl Weiss, Taghi M. Khoshgoftaar, DingDing Wang

Chapter 4. Visualizing Big Data

Abstract

This chapter provides a multi-disciplinary overview of the research issues and achievements in the field of Big Data and its visualization techniques and tools. The main aim is to summarize challenges in visualization methods for existing Big Data, as well as to offer novel solutions for issues related to the current state of Big Data Visualization. This paper provides a classification of existing data types, analytical methods, visualization techniques and tools, with a particular emphasis placed on surveying the evolution of visualization methodology over the past years. Based on the results, we reveal disadvantages of existing visualization methods. Despite the technological development of the modern world, human involvement (interaction), judgment and logical thinking are necessary while working with Big Data. Therefore, the role of human perceptional limitations involving large amounts of information is evaluated. Based on the results, a non-traditional approach is proposed: we discuss how the capabilities of Augmented Reality and Virtual Reality could be applied to the field of Big Data Visualization. We discuss the promising utility of Mixed Reality technology integration with applications in Big Data Visualization. Placing the most essential data in the central area of the human visual field in Mixed Reality would allow one to obtain the presented information in a short period of time without significant data losses due to human perceptual issues. Furthermore, we discuss the impacts of new technologies, such as Virtual Reality displays and Augmented Reality helmets on the Big Data visualization as well as to the classification of the main challenges of integrating the technology.

Ekaterina Olshannikova, Aleksandr Ometov, Yevgeni Koucheryavy, Thomas Olsson

Chapter 5. Deep Learning Techniques in Big Data Analytics

Abstract

Big Data Analytics and Deep Learning are two high-focus of data science. Big Data has become important as many organizations both public and private have been collecting massive amounts of domain-specific information, which can contain useful information about problems such as national intelligence, cyber security, fraud detection, marketing, and medical informatics. Companies such as Google and Microsoft are analyzing large volumes of data for business analysis and decisions, impacting existing and future technology. Deep Learning algorithms extract high-level, complex abstractions as data representations through a hierarchical learning process. Complex abstractions are learnt at a given level based on relatively simpler abstractions formulated in the preceding level in the hierarchy. A key benefit of Deep Learning is the analysis and learning of massive amounts of unsupervised data, making it a valuable tool for Big Data Analytics where raw data is largely unlabeled and un-categorized. In the present study, we explore how Deep Learning can be utilized for addressing some important problems in Big Data Analytics, including extracting complex patterns from massive volumes of data, semantic indexing, data tagging, fast information retrieval, and simplifying discriminative tasks. We also investigate some aspects of Deep Learning research that need further exploration to incorporate specific challenges introduced by Big Data Analytics, including streaming data, high-dimensional data, scalability of models, and distributed computing. We conclude by presenting insights into relevant future works by posing some questions, including defining data sampling criteria, domain adaptation modeling, defining criteria for obtaining useful data abstractions, improving semantic indexing, semi-supervised learning, and active learning.

Maryam M. Najafabadi, Flavio Villanustre, Taghi M. Khoshgoftaar, Naeem Seliya, Randall Wald, Edin Muharemagc

LexisNexis Risk Solution to Big Data

Frontmatter

Chapter 6. The HPCC/ECL Platform for Big Data

Abstract

As a result of the continuing information explosion, many organizations are experiencing what is now called the “Big Data” problem. This results in the inability of organizations to effectively use massive amounts of their data in datasets which have grown to big to process in a timely manner. Data-intensive computing represents a new computing paradigm [26] which can address the big data problem using high-performance architectures supporting scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement new applications previously thought to be impractical or infeasible.

Anthony M. Middleton, David Alan Bayliss, Gavin Halliday, Arjuna Chala, Borko Furht

Chapter 7. Scalable Automated Linking Technology for Big Data Computing

Abstract

The massive amount of data being collected at many organizations has led to what is now being called the “Big Data” problem, which limits the capability of organizations to process and use their data effectively and makes the record linkage process even more challenging [3, 13]. New high-performance data-intensive computing architectures supporting scalable parallel processing such as Hadoop MapReduce and HPCC allow government, commercial organizations, and research environments to process massive amounts of data and solve complex data processing problems including record linkage.

Anthony M. Middleton, David Bayliss, Bob Foreman

Chapter 8. Aggregated Data Analysis in HPCC Systems

Abstract

The HPCC (High Performance Cluster Computing) architecture is driven by a proprietary data processing language: Enterprise Control Language (ECL). When considered briefly, the proprietary nature of ECL may be perceived as a disadvantage when compared to a widespread query language such as SQL.

David Bayliss

Chapter 9. Models for Big Data

Abstract

The principal performance driver of a Big Data application is the data model in which the Big Data resides. Unfortunately most extant Big Data tools impose a data model upon a problem and thereby cripple their performance in some applications. The aim of this chapter is to discuss some of the principle data models that exist and are imposed; and then to argue that an industrial strength Big Data solution needs to be able to move between these models with a minimum of effort.

David Bayliss

Chapter 10. Data Intensive Supercomputing Solutions

Abstract

As a result of the continuing information explosion, many organizations are drowning in data and the resulting “data gap” or inability to process this information and use it effectively is increasing at an alarming rate. Data-intensive computing represents a new computing paradigm which can address the data gap using scalable parallel processing and allow government and commercial organizations and research environments to process massive amounts of data and implement applications previously thought to be impractical or infeasible.

Anthony M. Middleton

Chapter 11. Graph Processing with Massive Datasets: A Kel Primer

Abstract

Graph theory and the study of networks can be traced back to Leonhard Euler’s original paper on the Seven Bridges of Konigsberg, in 1736 [1]. Although the mathematical foundations to understanding graphs have been laid out over the last few centuries [2‐4], it wasn’t until recently, with the advent of modern computers, that parsing and analysis of large-scale graphs became tractable [5]. In the last decade, graph theory gained mainstream popularity following the adoption of graph models for new applications domains, including social networks and the web of data, both generating extremely large and dynamic graphs that cannot be adequately handled by legacy graph management applications [6].

David Bayliss, Flavio Villanustre

Big Data Applications

Frontmatter

Chapter 12. HPCC Systems for Cyber Security Analytics

Abstract

Many of the most daunting challenges in today’s cyber security world stem from a constant and overwhelming flow of raw network data. The volume, variety, and velocity at which this raw data is created and transmitted across networks is staggering; so staggering in fact, that the vast majority of data is typically regarded as background noise, often discarded or ignored, and thus stripped of the immense potential value that could be realized through proper analysis. When an organization is capable of comprehending this data in its totality—whether it originates from firewall logs, IDS alerts, server event logs, or other sources—then it can begin to identify and trace the markers, clues, and clusters of activity that represent threatening behavior.

Flavio Villanustre, Mauricio Renzi

Chapter 13. Social Network Analytics: Hidden and Complex Fraud Schemes

Abstract

In this chapter we briefly describe several case studies of using HPCCC systems in social network analytics.

Flavio Villanustre, Borko Furht

Chapter 14. Modeling Ebola Spread and Using HPCC/KEL System

Abstract

Epidemics have disturbed human lives for centuries causing massive numbers of deaths and illness among people and animals. Due to increase in urbanization, the possibility of worldwide epidemic is growing too. Infectious diseases like Ebola remain among the world’s leading causes of mortality and years of life lost. Addressing the significant disease burdens, which mostly impact the world’s poorest regions, is a huge challenge which requires new solutions and new technologies. This paper describes some of the models and mobile applications that can be used in determining the transmission, predicting the outbreak and preventing from an Ebola epidemic.

Jesse Shaw, Flavio Villanustre, Borko Furht, Ankur Agarwal, Abhishek Jain

Chapter 15. Unsupervised Learning and Image Classification in High Performance Computing Cluster

Abstract

Feature learning and object classification in machine learning are ongoing research areas. Identifying good features has various benefits for object classification with respect to reducing the computational cost and increasing the classification accuracy. In this study, we implement a new multimodal feature learning method and object identification framework using High Performance Computing Cluster (HPCC Systems^R). The framework first learns representative weights over un-labeled data for each model through the K-means unsupervised learning method. Then, the desired features are extracted from the labeled data using the correlation between the labeled data and representative bases. These labeled features are fused and fed to the classifiers to make the final recognition. HPCC Systems^R is a Big Data processing and massively parallel processing (MPP) computing platform used for solving Big Data problems. Algorithms are implemented in HPCC Systems^R with a language called Enterprise Control Language (ECL) which is a declarative, data-centric programming language. It is a powerful, high-level, parallel programming language ideal for Big Data intensive applications. The proposed framework is evaluated using various databases such as the CALTECH-101, AR databases, and a subset of wild PubFig83 data to which multimedia content is added. For instance, the classification accuracy result of [1] is improved from 74.3 to 78.9 % on AR database using Decision Tree C4.5 classifier.

I. Itauma, M. S. Aslan, X. W. Chen, Flavio Villanustre

Titel: Big Data Technologies and Applications
verfasst von: Borko Furht
Flavio Villanustre
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-44550-2
Print ISBN: 978-3-319-44548-9
DOI: https://doi.org/10.1007/978-3-319-44550-2