Skip to main content
main-content
Top

About this book

Use this guide to one of SQL Server 2019’s most impactful features—Big Data Clusters. You will learn about data virtualization and data lakes for this complete artificial intelligence (AI) and machine learning (ML) platform within the SQL Server database engine. You will know how to use Big Data Clusters to combine large volumes of streaming data for analysis along with data stored in a traditional database. For example, you can stream large volumes of data from Apache Spark in real time while executing Transact-SQL queries to bring in relevant additional data from your corporate, SQL Server database.
Filled with clear examples and use cases, this book provides everything necessary to get started working with Big Data Clusters in SQL Server 2019. You will learn about the architectural foundations that are made up from Kubernetes, Spark, HDFS, and SQL Server on Linux. You then are shown how to configure and deploy Big Data Clusters in on-premises environments or in the cloud. Next, you are taught about querying. You will learn to write queries in Transact-SQL—taking advantage of skills you have honed for years—and with those queries you will be able to examine and analyze data from a wide variety of sources such as Apache Spark.
Through the theoretical foundation provided in this book and easy-to-follow example scripts and notebooks, you will be ready to use and unveil the full potential of SQL Server 2019: combining different types of data spread across widely disparate sources into a single view that is useful for business intelligence and machine learning analysis.

What You Will LearnInstall, manage, and troubleshoot Big Data Clusters in cloud or on-premise environments
Analyze large volumes of data directly from SQL Server and/or Apache Spark
Manage data stored in HDFS from SQL Server as if it were relational data
Implement advanced analytics solutions through machine learning and AI
Expose different data sources as a single logical source using data virtualization

Who This Book Is For

Data engineers, data scientists, data architects, and database administrators who want to employ data virtualization and big data analytics in their environments

Table of Contents

Frontmatter

Chapter 1. What Are Big Data Clusters?

Abstract
SQL Server 2019 Big Data Clusters – or just Big Data Clusters are a new feature set within SQL Server 2019 with a broad range of functionality around data virtualization, data mart scale out, and artificial intelligence (AI).
Benjamin Weissman, Enrico van de Laar

Chapter 2. Big Data Cluster Architecture

Abstract
SQL Server Big Data Clusters are made up from a variety of technologies all working together to create a centralized, distributed data environment. In this chapter, we are going to look at the various technologies that make up Big Data Clusters through two different views.
Benjamin Weissman, Enrico van de Laar

Chapter 3. Deployment of Big Data Clusters

Abstract
Now it is time to install your very own SQL Server 2019 Big Data Cluster! We will be handling three different scenarios in detail and we will be using a fresh machine for each of those scenarios:
Benjamin Weissman, Enrico van de Laar

Chapter 4. Loading Data into Big Data Clusters

Abstract
With our first SQL Server Big Data Cluster in place, we should have a look at how we can use it. Therefore, we will start by adding some data to it.
Benjamin Weissman, Enrico van de Laar

Chapter 5. Querying Big Data Clusters Through T-SQL

Abstract
Now that we have some data to play with, let’s look at how we can process and query that data through the multiple options provided through Azure Data Studio.
Benjamin Weissman, Enrico van de Laar

Chapter 6. Working with Spark in Big Data Clusters

Abstract
So far, we have been querying data inside our SQL Server Big Data Cluster using external tables and T-SQL code. We do, however, have another method available to query data that is stored inside the HDFS filesystem of your Big Data Cluster. As you have read in Chapter 2, Big Data Clusters also have Spark included in the architecture, meaning we can leverage the power of Spark to query data stored inside our Big Data Cluster.
Benjamin Weissman, Enrico van de Laar

Chapter 7. Machine Learning on Big Data Clusters

Abstract
In the previous chapters, we spent significant time on how we can query data stored inside SQL Server instances or on HDFS through Spark. One advantage of having access to data stored in different formats is that it allows you to perform analysis of the data at a large, and distributed, scale. One of the more powerful options we can utilize inside Big Data Clusters is the ability to implement machine learning solutions on our data. Because Big Data Clusters allow us to store massive amounts of data in all kinds of formats and sizes, the ability to train, and utilize, machine learning models across all of that data becomes far easier.
Benjamin Weissman, Enrico van de Laar

Chapter 8. Create and Consume Big Data Cluster Apps

Abstract
One of the capabilities of SQL Server Big Data Clusters is the ability to build and run custom applications on its surface. This is actually a very powerful feature, since it allows you to script and run a wide variety of solutions on top of your Big Data Cluster. For instance, you can create an application, or app as we will call it in the remainder of this chapter, to perform various maintenance tasks on top of your data like a database backup. Another example is the ability to create an entry point for your machine learning processes through a REST API, a use case which we will explore later in this chapter.
Benjamin Weissman, Enrico van de Laar

Chapter 9. Maintenance of Big Data Clusters

Abstract
Last but not least, we want to look at how you can check the health of your Big Data Cluster, how an existing Big Data Cluster can be upgraded to a newer version, and how you can remove a Big Data Cluster instance, if it’s no longer needed.
Benjamin Weissman, Enrico van de Laar

Backmatter

Additional information

Premium Partner

    Image Credits