Skip to main content
main-content
Top

About this book

Leverage machine and deep learning models to build applications on real-time data using PySpark. This book is perfect for those who want to learn to use this language to perform exploratory data analysis and solve an array of business challenges.
You'll start by reviewing PySpark fundamentals, such as Spark’s core architecture, and see how to use PySpark for big data processing like data ingestion, cleaning, and transformations techniques. This is followed by building workflows for analyzing streaming data using PySpark and a comparison of various streaming platforms.
You'll then see how to schedule different spark jobs using Airflow with PySpark and book examine tuning machine and deep learning models for real-time predictions. This book concludes with a discussion on graph frames and performing network analysis using graph algorithms in PySpark. All the code presented in the book will be available in Python scripts on Github.
What You'll LearnDevelop pipelines for streaming data processing using PySpark
Build Machine Learning & Deep Learning models using PySpark latest offerings
Use graph analytics using PySpark
Create Sequence Embeddings from Text data
Who This Book is For

Data Scientists, machine learning and deep learning engineers who want to learn and use PySpark for real time analysis on streaming data.

Table of Contents

Frontmatter

Chapter 1. Introduction to Spark

Abstract
As this book is about Spark, it makes perfect sense to start the first chapter by looking into some of Spark’s history and its different components. This introductory chapter is divided into three sections. In the first, I go over the evolution of data and how it got as far as it has, in terms of size. I’ll touch on three key aspects of data. In the second section, I delve into the internals of Spark and go over the details of its different components, including its architecture and modus operandi. The third and final section of this chapter focuses on how to use Spark in a cloud environment.
Pramod Singh

Chapter 2. Data Processing

Abstract
This chapter covers different steps to preprocess and handle data in PySpark. Preprocessing techniques can certainly vary from case to case, and many different methods can be used to massage the data into desired form. The idea of this chapter is to expose some of the common techniques for dealing with big data in Spark. In this chapter, we are going to go over different steps involved in preprocessing data, such as handling missing values, merging datasets, applying functions, aggregations, and sorting. One major part of data preprocessing is the transformation of numerical columns into categorical ones and vice versa, which we are going to look at over the next few chapters and are based on machine learning. The dataset that we are going to make use of in this chapter is inspired by a primary research dataset and contains a few attributes from the original dataset, with additional columns containing fabricated data points.
Pramod Singh

Chapter 3. Spark Structured Streaming

Abstract
This chapter discusses how to use Spark’s streaming API to process real-time data. The first part focuses on the main difference between streaming and batch data, in addition to their specific applications. The second section provides details on the Structured Streaming API and its various improvements over previous RDD-based Spark streaming APIs. The final section includes the code to use for Structured Streaming on incoming data and discusses how to save the output results in memory. We’ll also look at an alternative to Structured Streaming.
Pramod Singh

Chapter 4. Airflow

Abstract
This chapter focuses on introducing Airflow and how it can be used to handle complex data workflows. Airflow was developed in-house by Airbnb engineers, to manage internal workflows in an efficient manner. Airflow later went on to become part of Apache in 2016 and was made available to users as an open source. Basically, Airflow is a framework for executing, scheduling, distributing, and monitoring various jobs in which there can be multiple tasks that are either interdependent or independent of one another. Every job that is run using Airflow must be defined via a directed acyclic graph (DAG) definition file, which contains a collection you want to run, grouped by relationships and dependencies.
Pramod Singh

Chapter 5. MLlib: Machine Learning Library

Abstract
Depending on your requirements, there are multiple ways in which you can build machine learning models, using preexisting libraries, such as Python’s scikit-learn, R, and TensorFlow. However, what makes Spark’s Machine Learning library (MLlib) really useful is its ability to train models on scale and provide distributed training. This allows users to quickly build models on a huge dataset, in addition to preprocessing and preparing workflows with the Spark framework itself.
Pramod Singh

Chapter 6. Supervised Machine Learning

Abstract
Machine learning can be broadly divided into four categories: supervised machine learning and unsupervised machine learning and, to a lesser extent, semi-supervised machine learning and reinforcement machine learning. Because supervised machine learning drives a lot of business applications and significantly affects our day-to-day lives, it is considered one of the most important categories.
Pramod Singh

Chapter 7. Unsupervised Machine Learning

Abstract
As the name suggests, unsupervised machine learning does not include finding relationships between input and output. To be honest, there is no output that we try to predict in unsupervised learning. It is mainly used to group together the features that seem to be similar to one another in some sense. These can be the distance between those features or some sort of similarity metric. In this chapter, I will touch on some unsupervised machine learning techniques and build one of the machine learning models, using PySpark to categorize users into groups and, later, to visualize those groups as well.
Pramod Singh

Chapter 8. Deep Learning Using PySpark

Abstract
Deep learning has been in the limelight for quite a few years and is making leaps and bounds in terms of solving various business challenges. From image language translation to self-driving cars, deep learning has become an important component in the larger scheme of things. There is no denying the fact that lots of companies today are betting heavily on deep learning, as a majority of their applications run using deep learning in the back end. For example, Google’s Gmail, YouTube, Search, Maps, and Assistance all use deep learning in some form or other. The reason is deep learning’s incredible ability to provide far better results, compared to some other machine learning algorithms.
Pramod Singh

Backmatter

Additional information

Premium Partner

    Image Credits