Top

2015 | Book

Read chapter Read first chapter

Guide to High Performance Distributed Computing

Case Studies with Hadoop, Scalding and Spark

Authors: K.G. Srinivasa, Anil Kumar Muppalla

Publisher: Springer International Publishing

Book Series : Computer Communications and Networks

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This timely text/reference describes the development and implementation of large-scale distributed processing systems using open source tools and technologies. Comprehensive in scope, the book presents state-of-the-art material on building high performance distributed computing systems, providing practical guidance and best practices as well as describing theoretical software frameworks. Features: describes the fundamentals of building scalable software systems for large-scale data processing in the new paradigm of high performance distributed computing; presents an overview of the Hadoop ecosystem, followed by step-by-step instruction on its installation, programming and execution; Reviews the basics of Spark, including resilient distributed datasets, and examines Hadoop streaming and working with Scalding; Provides detailed case studies on approaches to clustering, data classification and regression analysis; Explains the process of creating a working recommender system using Scalding and Spark.

Frontmatter

Programming Fundamentals of High Performance Distributed Computing

Frontmatter

Chapter 1. Introduction

Abstract

Distributed Computing focusing on a range of ideas and topics. This chapter identifies several properties of distributed systems in general, it also discusses briefly on the different types of systems shedding light on popular architectures used in successful distributed system arrangements. It further goes on to identify several challenges and hints at several research areas. The chapters ends with trends and examples where distributed systems have perhaps contributed immensely.

K. G. Srinivasa, Anil Kumar Muppalla

Chapter 2. Getting Started with Hadoop

Abstract

Apache Hadoop is a software framework that allows distributed processing of large datasets across clusters of computers using simple programming constructs/models. It is designed to scale-up from a single server to thousands of nodes. It is designed to detect failures at the application level rather than rely on hardware for high-availability thereby delivering a highly available service on top of cluster of commodity hardware nodes each of which is prone to failures [2]. While Hadoop can be run on a single machine the true power of Hadoop is realized in its ability to scale-up to thousands of computers, each with several processor cores. It also distributes large amounts of work across the clusters efficiently [1].

K. G. Srinivasa, Anil Kumar Muppalla

Chapter 3. Getting Started with Spark

Abstract

Cluster computing has seen a rise in improved and popular computing models, in which clusters execute data-parallel computations on unreliable machines. This is enabled by software systems that provide locality-aware scheduling, fault tolerance, and load balancing. MapReduce [1] has become the front runner in pioneering this model, while systems like Map-Reduce-Merge [2] and Dryad [3] have generalized different data flow types. These systems are scalable and fault tolerant because they provide a programming model that enables users in creating acyclic data flow graphs to pass input data through a set of operations. This model enables the system to schedule and react to faults better without any user intervention. While this model can be applied to a lot applications, there are problems that cannot be solved efficiently by acyclic data flows.

K. G. Srinivasa, Anil Kumar Muppalla

Chapter 4. Programming Internals of Scalding and Spark

Abstract

Scalding is a Scala-based library built on top of Cascading, a Java library that forms an abstraction over low-level Hadoop API. It is comparable to Pig, but brings the advantages of Scala in building MapReduce jobs [1].

K.G. Srinivasa, Anil Kumar Muppalla

Case Studies Using Hadoop, Scalding and Spark

Frontmatter

Chapter 5. Case Study I: Data Clustering using Scalding and Spark

Abstract

Data mining is the process of discovering insightful, interesting, and novel patterns, as well as descriptive, understandable, and predictive models from large-scale data.

K G Srinivasa, Anil Kumar Muppalla

Chapter 6. Case Study II: Data Classification using Scalding and Spark

Abstract

It is important to characterize learning problems depending on type of data they use. knowledge about the data is very important as similar learning techniques can be applied to similar data types. For example, Natural Language Processing and Bio-informatics use very similar tools for strings for natural language text and DNA sequences. The most basic type of data entities are Vectors . For example, an insurance corporation may want a vector of patient details like blood pressure, heart rate, height, weight, cholesterol, smoking status, gender to infer the patients life expectancy. A farmer might be interested in determining the ripeness of the fruit based on a vector of size, weight, spectral data. An electrical engineer may want to find dependency between voltage and current. A search engine might want to a vector of counts which describe the frequency of words.

K G Srinivasa, Anil Kumar Muppalla

Chapter 7. Case Study III: Regression Analysis using Scalding and Spark

Abstract

Regression analysis is usually applied to prediction and forecasting with substantial overlap with the field of machine learning. The relationship between the dependent and independent variables is determined through regression and to explore different forms of these relationships. In certain circumstances where the assumptions are restricted regression helps to infer a casual relationship. However, caution is advised as this can lead to illusions.

K G Srinivasa, Anil Kumar Muppalla

Chapter 8. Case Study IV: Recommender System Using Scalding and Spark

Abstract

Recommender Systems are software tools that are used to suggest items of use to users based on certain assumptions [1, 2]. The item here refers to an entity that the system recommends to the users, and accordingly the recommender system’s design, GUI, recommendation technique are dependent on the specific type of item in the discussion.

K. G. Srinivasa, Anil Kumar Muppalla

Backmatter

Title: Guide to High Performance Distributed Computing
Authors: K.G. Srinivasa
Anil Kumar Muppalla
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-13497-0
Print ISBN: 978-3-319-13496-3
DOI: https://doi.org/10.1007/978-3-319-13497-0

Springer Professional

Guide to High Performance Distributed Computing

Case Studies with Hadoop, Scalding and Spark

About this book

Table of Contents

Frontmatter

Programming Fundamentals of High Performance Distributed Computing

Frontmatter

Chapter 1. Introduction

Chapter 2. Getting Started with Hadoop

Chapter 3. Getting Started with Spark

Chapter 4. Programming Internals of Scalding and Spark

Case Studies Using Hadoop, Scalding and Spark

Frontmatter

Chapter 5. Case Study I: Data Clustering using Scalding and Spark

Chapter 6. Case Study II: Data Classification using Scalding and Spark

Chapter 7. Case Study III: Regression Analysis using Scalding and Spark

Chapter 8. Case Study IV: Recommender System Using Scalding and Spark

Backmatter

Premium Partner