Top

2015 | Book

Read chapter Read first chapter

Big-Data Analytics and Cloud Computing

Theory, Algorithms and Applications

Editors: Dr. Marcello Trovati, Prof. Richard Hill, Dr. Ashiq Anjum, Dr. Shao Ying Zhu, Prof. Lu Liu

Publisher: Springer International Publishing

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book reviews the theoretical concepts, leading-edge techniques and practical tools involved in the latest multi-disciplinary approaches addressing the challenges of big data. Illuminating perspectives from both academia and industry are presented by an international selection of experts in big data science. Topics and features: describes the innovative advances in theoretical aspects of big data, predictive analytics and cloud-based architectures; examines the applications and implementations that utilize big data in cloud architectures; surveys the state of the art in architectural approaches to the provision of cloud-based big data analytics functions; identifies potential research directions and technologies to facilitate the realization of emerging business models through big data approaches; provides relevant theoretical frameworks, empirical research findings, and numerous case studies; discusses real-world applications of algorithms and techniques to address the challenges of big datasets.

Frontmatter

Theory

Frontmatter

Chapter 1. Data Quality Monitoring of Cloud Databases Based on Data Quality SLAs

Abstract

This chapter provides an overview of the tasks related to the continuous process of monitoring the quality of cloud databases as their content is modified over time. In the Software as a Service context, this process must be guided by data quality service level agreements, which aim to specify customers’ requirements regarding the process of data quality monitoring. In practice, factors such as the Big Data scale, lack of data structure, strict service level agreement requirements, and the velocity of the changes over the data imply many challenges for an effective accomplishment of this process. In this context, we present a high-level architecture of a cloud service, which employs cloud computing capabilities in order to tackle these challenges, as well as the technical and research problems that may be further explored to allow an effective deployment of the presented service.

Dimas C. Nascimento, Carlos Eduardo Pires, Demetrio Mestre

Chapter 2. Role and Importance of Semantic Search in Big Data Governance

Abstract

Big Data promise to funnel masses of data into our information ecosystems where they let flourish a yet unseen variety of information, providing us with insights yet undreamed of. However, only if we are able to organize and arrange this deluge of variety according into something meaningful to us, we can expect new insights and thus benefit from Big Data. This chapter demonstrates that text analysis is essential for Big Data governance. However, it must reach beyond keyword analysis. We need a design of semantic search for Big Data. This design has to include the individual nature of discovery and a strong focus on the information consumer. In short, it has to address self-directed information discovery. There are too many information discovery requests that cannot be addressed by mainstream Big Data technologies. Many requests often address less spectacular questions on a global scale but essentially important ones for individual information consumers. We present an open discovery language (ODL) that can completely be controlled by information consumers. ODL is a Big Data technology that embraces the agile design of discovery from the information consumer’s perspective. We want users to experiment with discovery and to adapt it to their individual needs.

Kurt Englmeier

Chapter 3. Multimedia Big Data: Content Analysis and Retrieval

Abstract

This chapter surveys recent developments in the area of multimedia big data, the biggest big data. One core problem is how to best process this multimedia big data in an efficient and scalable way. We outline examples of the use of the MapReduce framework, including Hadoop, which has become the most common approach to a truly scalable and efficient framework for common multimedia processing tasks, e.g., content analysis and retrieval. We also examine recent developments on deep learning which has produced promising results in large-scale multimedia processing and retrieval. Overall the focus has been on empirical studies rather than the theoretical so as to highlight the most practically successful recent developments and highlight the associated caveats or lessons learned.

Jer Hayes

Chapter 4. An Overview of Some Theoretical Topological Aspects of Big Data

Abstract

The growth of Big Data has expanded the traditional data science approaches to address the multiple challenges associated with this field. Furthermore, the wealth of data available from a wide range of sources has fundamentally changed the requirements for theoretical methods to provide insight into this field. In this chapter, a general overview on some theoretical aspects related to Big Data is discussed.

Marcello Trovati

Applications

Frontmatter

Chapter 5. Integrating Twitter Traffic Information with Kalman Filter Models for Public Transportation Vehicle Arrival Time Prediction

Abstract

Accurate bus arrival time prediction is key for improving the attractiveness of public transport, as it helps users better manage their travel schedule. This paper proposes a model of bus arrival time prediction, which aims to improve arrival time accuracy. This model is intended to function as a preprocessing stage to handle real-world input data in advance of further processing by a Kalman filtering model; as such, the model is able to overcome the data processing limitations in existing models and can improve accuracy of output information. The arrival time is predicted using a Kalman filter (KF) model, by using information acquired from social network communication, especially Twitter. The KF model predicts the arrival time by filtering the noise or disturbance during the journey. Twitter offers an API to retrieve live, real-time road traffic information and offers semantic analysis of the retrieved twitter data. Data in Twitter, which have been processed, can be considered as a new input for route calculations and updates. This data will be fed into KF models for further processing to produce a new arrival time estimation.

Ahmad Faisal Abidin, Mario Kolberg, Amir Hussain

Chapter 6. Data Science and Big Data Analytics at Career Builder

Abstract

In the online job recruitment domain, matching job seekers with relevant jobs is critical for closing the skills gap. When dealing with millions of resumes and job postings, such matching analytics involve several Big Data challenges. At CareerBuilder, we tackle these challenges by (i) classifying large datasets of job ads and job seeker resumes to occupation categories and (ii) providing a scalable framework that facilitates executing web services for Big Data applications.

In this chapter, we discuss two systems currently in production at CareerBuilder that facilitate our goal of closing the skills gap. These systems also power several downstream applications and labor market analytics products. We first discuss Carotene, a large-scale, machine learning-based semi-supervised job title classification system. Carotene has a coarse and fine-grained cascade architecture and a clustering based job title taxonomy discovery component that facilitates discovering more fine-grained job titles than the ones in the industry standard occupation taxonomy. We then describe CARBi, a system for developing and deploying Big Data applications for understanding and improving job-resume dynamics. CARBi consists of two components: (i) WebScalding, a library that provides quick access to commonly used datasets, database tables, data formats, web services, and helper functions to access and transform data, and (ii) ScriptDB, a standalone application that helps developers execute and manage Big Data projects. The system is built in such a way that every job developed using CARBi can be executed in local and cluster modes.

Faizan Javed, Ferosh Jacob

Chapter 7. Extraction of Bayesian Networks from Large Unstructured Datasets

Abstract

Bayesian networks (BNs) provide a useful modelling tool with a wide applicability on a variety of research and business areas. However, their construction is very time-consuming when carried out manually. In this chapter, we discuss an automated method to identify, assess and aggregate relevant information from large unstructured datasets to build fragments of BNs.

Marcello Trovati

Chapter 8. Two Case Studies Based on Large Unstructured Sets

Abstract

In this chapter, we shall present two case studies based on large unstructured datasets. The former specifically considers the Patient Health Questionnaire (PHQ-9), which is the most common depression assessment tool, suggesting the severity and type of depression an individual may be suffering from. In particular, we shall assess a method which appears to enhance the current system in place for health professionals when diagnosing depression. This is based on a combination of a computational assessment method, with a mathematical ranking system defined from a large unstructured dataset consisting of abstracts available from PubMed. The latter refers to a probabilistic extraction method introduced in Trovati et al. (IEEE Trans ADD, 2015, submitted). We shall consider three different datasets introduced in Trovati et al. (IEEE Trans ADD, 2015, submitted; Extraction, identification and ranking of network structures from data sets. In: Proceedings of CISIS, Birmingham, pp 331–337, 2014) and Trovati (Int J Distrib Syst Technol, 2015, in press), whose results clearly indicate the reliability and efficiency of this type of approach when addressing large unstructured datasets. This is part of ongoing research aiming to provide a tool to extract, assess and visualise intelligence extracted from large unstructured datasets.

Aaron Johnson, Paul Holmes, Lewis Craske, Marcello Trovati, Nik Bessis, Peter Larcombe

Chapter 9. Information Extraction from Unstructured Data Sets: An Application to Cardiac Arrhythmia Detection

Abstract

In this chapter, we will discuss a case study, which semi-automatically defines fuzzy partition rules to provide a powerful and accurate insight into cardiac arrhythmia. In particular, this is based on large unstructured data sets in the form of scientific papers focusing on cardiology. The information extracted is subsequently combined with expert knowledge, as well as experimental data, to provide a robust, scalable and accurate system. The evaluation clearly shows a high accuracy rate, namely, 92.6 %, as well as transparency of the system, which is a remarkable improvement with respect to the current research in the field.

Omar Behadada

Chapter 10. A Platform for Analytics on Social Networks Derived from Organisational Calendar Data

Abstract

In this paper, we present a social network analytics platform with a NoSQL Graph datastore. The platform was developed for facilitating communication, management of interactions and the boosting of social capital in large organisations. As with the majority of social software, our platform requires a large scale of data to be consumed, processed and exploited for the generation of its automated social networks. The platforms purpose is to reduce the cost and effort attributed to managing and maintaining communication strategies within an organisation through the analytics performed upon social networks generated from existing data. The following paper focuses on the process of acquiring and processing redundant calendar data available to all organisations and processing it into a social network that can be analysed.

Dominic Davies-Tagg, Ashiq Anjum, Richard Hill

Backmatter

Title: Big-Data Analytics and Cloud Computing
Editors: Dr. Marcello Trovati
Prof. Richard Hill
Dr. Ashiq Anjum
Dr. Shao Ying Zhu
Prof. Lu Liu
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-25313-8
Print ISBN: 978-3-319-25311-4
DOI: https://doi.org/10.1007/978-3-319-25313-8

Springer Professional

Big-Data Analytics and Cloud Computing

Theory, Algorithms and Applications

About this book

Table of Contents

Frontmatter

Theory

Frontmatter

Chapter 1. Data Quality Monitoring of Cloud Databases Based on Data Quality SLAs

Chapter 2. Role and Importance of Semantic Search in Big Data Governance

Chapter 3. Multimedia Big Data: Content Analysis and Retrieval

Chapter 4. An Overview of Some Theoretical Topological Aspects of Big Data

Applications

Frontmatter

Chapter 5. Integrating Twitter Traffic Information with Kalman Filter Models for Public Transportation Vehicle Arrival Time Prediction

Chapter 6. Data Science and Big Data Analytics at Career Builder

Chapter 7. Extraction of Bayesian Networks from Large Unstructured Datasets

Chapter 8. Two Case Studies Based on Large Unstructured Sets

Chapter 9. Information Extraction from Unstructured Data Sets: An Application to Cardiac Arrhythmia Detection

Chapter 10. A Platform for Analytics on Social Networks Derived from Organisational Calendar Data

Backmatter

Premium Partner