Skip to main content
Top

2019 | Book

Learn PySpark

Build Python-based Machine Learning and Deep Learning Models

insite
SEARCH

About this book

Leverage machine and deep learning models to build applications on real-time data using PySpark. This book is perfect for those who want to learn to use this language to perform exploratory data analysis and solve an array of business challenges.
You'll start by reviewing PySpark fundamentals, such as Spark’s core architecture, and see how to use PySpark for big data processing like data ingestion, cleaning, and transformations techniques. This is followed by building workflows for analyzing streaming data using PySpark and a comparison of various streaming platforms.
You'll then see how to schedule different spark jobs using Airflow with PySpark and book examine tuning machine and deep learning models for real-time predictions. This book concludes with a discussion on graph frames and performing network analysis using graph algorithms in PySpark. All the code presented in the book will be available in Python scripts on Github.
What You'll LearnDevelop pipelines for streaming data processing using PySpark
Build Machine Learning & Deep Learning models using PySpark latest offerings
Use graph analytics using PySpark
Create Sequence Embeddings from Text data
Who This Book is For

Data Scientists, machine learning and deep learning engineers who want to learn and use PySpark for real time analysis on streaming data.

Table of Contents

Frontmatter
Chapter 1. Introduction to Spark
Abstract
As this book is about Spark, it makes perfect sense to start the first chapter by looking into some of Spark’s history and its different components. This introductory chapter is divided into three sections. In the first, I go over the evolution of data and how it got as far as it has, in terms of size. I’ll touch on three key aspects of data. In the second section, I delve into the internals of Spark and go over the details of its different components, including its architecture and modus operandi. The third and final section of this chapter focuses on how to use Spark in a cloud environment.
Pramod Singh
Chapter 2. Data Processing
Abstract
This chapter covers different steps to preprocess and handle data in PySpark. Preprocessing techniques can certainly vary from case to case, and many different methods can be used to massage the data into desired form. The idea of this chapter is to expose some of the common techniques for dealing with big data in Spark. In this chapter, we are going to go over different steps involved in preprocessing data, such as handling missing values, merging datasets, applying functions, aggregations, and sorting. One major part of data preprocessing is the transformation of numerical columns into categorical ones and vice versa, which we are going to look at over the next few chapters and are based on machine learning. The dataset that we are going to make use of in this chapter is inspired by a primary research dataset and contains a few attributes from the original dataset, with additional columns containing fabricated data points.
Pramod Singh
Chapter 3. Spark Structured Streaming
Abstract
This chapter discusses how to use Spark’s streaming API to process real-time data. The first part focuses on the main difference between streaming and batch data, in addition to their specific applications. The second section provides details on the Structured Streaming API and its various improvements over previous RDD-based Spark streaming APIs. The final section includes the code to use for Structured Streaming on incoming data and discusses how to save the output results in memory. We’ll also look at an alternative to Structured Streaming.
Pramod Singh
Chapter 4. Airflow
Abstract
This chapter focuses on introducing Airflow and how it can be used to handle complex data workflows. Airflow was developed in-house by Airbnb engineers, to manage internal workflows in an efficient manner. Airflow later went on to become part of Apache in 2016 and was made available to users as an open source. Basically, Airflow is a framework for executing, scheduling, distributing, and monitoring various jobs in which there can be multiple tasks that are either interdependent or independent of one another. Every job that is run using Airflow must be defined via a directed acyclic graph (DAG) definition file, which contains a collection you want to run, grouped by relationships and dependencies.
Pramod Singh
Chapter 5. MLlib: Machine Learning Library
Abstract
Depending on your requirements, there are multiple ways in which you can build machine learning models, using preexisting libraries, such as Python’s scikit-learn, R, and TensorFlow. However, what makes Spark’s Machine Learning library (MLlib) really useful is its ability to train models on scale and provide distributed training. This allows users to quickly build models on a huge dataset, in addition to preprocessing and preparing workflows with the Spark framework itself.
Pramod Singh
Chapter 6. Supervised Machine Learning
Abstract
Machine learning can be broadly divided into four categories: supervised machine learning and unsupervised machine learning and, to a lesser extent, semi-supervised machine learning and reinforcement machine learning. Because supervised machine learning drives a lot of business applications and significantly affects our day-to-day lives, it is considered one of the most important categories.
Pramod Singh
Chapter 7. Unsupervised Machine Learning
Abstract
As the name suggests, unsupervised machine learning does not include finding relationships between input and output. To be honest, there is no output that we try to predict in unsupervised learning. It is mainly used to group together the features that seem to be similar to one another in some sense. These can be the distance between those features or some sort of similarity metric. In this chapter, I will touch on some unsupervised machine learning techniques and build one of the machine learning models, using PySpark to categorize users into groups and, later, to visualize those groups as well.
Pramod Singh
Chapter 8. Deep Learning Using PySpark
Abstract
Deep learning has been in the limelight for quite a few years and is making leaps and bounds in terms of solving various business challenges. From image language translation to self-driving cars, deep learning has become an important component in the larger scheme of things. There is no denying the fact that lots of companies today are betting heavily on deep learning, as a majority of their applications run using deep learning in the back end. For example, Google’s Gmail, YouTube, Search, Maps, and Assistance all use deep learning in some form or other. The reason is deep learning’s incredible ability to provide far better results, compared to some other machine learning algorithms.
Pramod Singh
Backmatter
Metadata
Title
Learn PySpark
Author
Pramod Singh
Copyright Year
2019
Publisher
Apress
Electronic ISBN
978-1-4842-4961-1
Print ISBN
978-1-4842-4960-4
DOI
https://doi.org/10.1007/978-1-4842-4961-1

Premium Partner