Skip to main content
main-content
Top

About this book

Build a modern data warehouse on Microsoft's Azure Platform that is flexible, adaptable, and fast—fast to snap together, reconfigure, and fast at delivering results to drive good decision making in your business.

Gone are the days when data warehousing projects were lumbering dinosaur-style projects that took forever, drained budgets, and produced business intelligence (BI) just in time to tell you what to do 10 years ago. This book will show you how to assemble a data warehouse solution like a jigsaw puzzle by connecting specific Azure technologies that address your own needs and bring value to your business. You will see how to implement a range of architectural patterns using batches, events, and streams for both data lake technology and SQL databases. You will discover how to manage metadata and automation to accelerate the development of your warehouse while establishing resilience at every level. And you will know how to feed downstream analytic solutions such as Power BI and Azure Analysis Services to empower data-driven decision making that drives your business forward toward a pattern of success.
This book teaches you how to employ the Azure platform in a strategy to dramatically improve implementation speed and flexibility of data warehousing systems. You will know how to make correct decisions in design, architecture, and infrastructure such as choosing which type of SQL engine (from at least three options) best meets the needs of your organization. You also will learn about ETL/ELT structure and the vast number of accelerators and patterns that can be used to aid implementation and ensure resilience. Data warehouse developers and architects will find this book a tremendous resource for moving their skills into the future through cloud-based implementations.

What You Will LearnChoose the appropriate Azure SQL engine for implementing a given data warehouse
Develop smart, reusable ETL/ELT processes that are resilient and easily maintained
Automate mundane development tasks through tools such as PowerShell
Ensure consistency of data by creating and enforcing data contracts
Explore streaming and event-driven architectures for data ingestionCreate advanced staging layers using Azure Data Lake Gen 2 to feed your data warehouse

Who This Book Is For
Data warehouse or ETL/ELT developers who wish to implement a data warehouse project in the Azure cloud, and developers currently working in on-premise environments who want to move to the cloud, and for developers with Azure experience looking to tighten up their implementation and consolidate their knowledge

Table of Contents

Frontmatter

Chapter 1. The Rise of the Modern Data Warehouse

Abstract
A data warehouse is a common and well-understood technology asset that underpins many decision support systems. Whether the warehouse was initially designed to act as a hub for data integration or a base for analytical consistency, many organizations make use of the concepts and technologies and underpin data warehousing.
Matt How

Chapter 2. The SQL Engine

Abstract
The focus of this chapter is to break open the mysteries of each SQL storage engine and understand why a particular flavor of Azure SQL technology suits one scenario over another. We will analyze the underlying architecture of each service so that development choices can be well informed and well reasoned. Once we understand how each implementation of the SQL engine in Azure processes and stores data, we can look at the direction Microsoft is taking that technology and forecast whether the same choice would be made in the future. The knowledge gained in this chapter should provide you with the capability to understand your source data and therefore to choose which SQL engine should be used to store and process that data.
Matt How

Chapter 3. The Integration Engine

Abstract
The concept of data integration often sparks a lengthy and convoluted debate as to the best approach and technology for the given sources and destinations. In addition to the out-the-box products such as SQL Server Integration Services (SSIS), there is also a wealth of open source tools to consider, not forgetting the third-party connectors and bespoke, source system–specific integration tools that all help to muddy the water.
Matt How

Chapter 4. The Ingestion Architecture

Abstract
Data does not stand still. As data warehouse developers, this is a known fact on which our careers are based. For data to have value, it has to be reliably moved to a place where that value can be realized and the method by which we move data should depend on the needs of our users and the frequency of the data, not on the physical or technological limits of the system. As this book examines a modern data warehouse, we need to research beyond the traditional defaults such as batch-based ingestion and simple lift and shift extract, transform, and load (ETL) patterns and explore how we offer more flexibility to the end users. This chapter outlines an approach for warehouse loading that promotes efficiency and resilience, moving on to describe three ingestion modes. By defining the risks and benefits of batch-based, event-based, and streaming modes, you will know how to implement each approach while also being aware of the additional complexities of each, ensuring a successful implementation.
Matt How

Chapter 5. The Role of the Data Lake

Abstract
As the data needs of a business change, the methods to store, organize, and audit data need to change as well. Big data is the buzz word of the day, and big data needs a scalable storage platform. Multimedia files such as images, videos, and audio files need to be co-located and reported against, and so a platform that can accommodate such diverse data types is required. A modern data platform may also need to ingest data at incredibly high speeds, and having a platform that can cope with streaming and scale accordingly is essential. There is such a variety of requirements for data storage with modern businesses that managing and maintaining storage systems specifically for each would be impossible. What is needed is a simple option that implements a “less is more” approach to offer the scalability and diversity required. What is needed is a data lake.
Matt How

Chapter 6. The Role of the Data Contract

Abstract
In all data integration projects, there is always a concern about datasets changing their properties. This could be changing columns, changing data types, or even changing the degree of quality instilled in the data. The technical name for this is “Schema Evolution,” sometimes known as Schema Drift, and whether that be new columns arriving or known columns dropping off, how these situations are handled can have a huge effect on the success of the project. At a basic level, you need to be able to detect and react to occasions when a datasets schema has evolved, and with the vast amount of file and database types available, this task is getting more complex. Not only do you need to detect changes in tabular data (CSV files, database extracts) but also in semi-structured datasets such as JSON and XML. Expanding on this basic concept, you need to be able to handle the schema drift so that you can continue to integrate the data without having to manage multiple extraction methods for the same type of data. This may be manual to begin with, but there are tools out there now that can automatically handle schema evolution. As you begin to write ingestion procedures, remember that maintaining these schemas through schema evolution needs to be simple. If you get to a point where you are ingesting over 20 different files or datasets, then you do not want to have to visit each script to update the schema. Instead we need a centralized schema store so that we can easily make updates in a controlled way.
Matt How

Chapter 7. Logging, Auditing, and Resilience

Abstract
Things will go wrong in your data processing pipeline. I wish there was a less blunt way to say it, but it is true. In the majority of cases, it may not even be the fault of the platform or the developers. It could be the source provider updating their software, or an intermittent loss of connectivity to an Azure service, or even a harmless comma manually entered into just the wrong place. Whatever the fault is or how trivial it may be, they all have the ability to disrupt your warehouse and ultimately cause loss of service to your users.
Matt How

Chapter 8. Using Scripting and Automation

Abstract
A common attribute of many developers is the desire to do things quickly, consistently, and once only. To address this desire, scripting and automation are often used as they provide a hyper consistent method to complete regularly occurring tasks. This chapter aims to walk through three of my most used scripts in the hope that they can also be useful to readers of this book. All the scripts featured in this chapter are written in PowerShell and operate or automate key pieces of a modern data warehouse, the SQL engine, Data Factory, and data lake.
Matt How

Chapter 9. Beyond the Modern Data Warehouse

Abstract
In days gone by, a data warehouse stood as a slow-moving, often large, unwieldly part of a wider decision support system. While tools and technologies that feed to or read from the data warehouse may develop, the complexity of such an artifact and the investment in its development mean that the warehouse would rarely benefit from such upgrades. Throughout this book, I have explained and demonstrated the highlights of building a modern data warehouse in Azure – one which can be developed rapidly and be highly flexible to source system requirements, one which can move and develop with the times and not cause sleepless nights worrying over the SQL version going out of support, one which can ingest in batch-, stream-, or event-based modes offering ultimate speed and time to insight. The focus of this final chapter is now to look at what sits beyond the modern data warehouse. There is a wealth of BI products in the market that provide a range of capabilities and visualizations to the end user, and it can be very difficult to choose between them without a thorough review. This chapter is not a thorough review of BI products but instead give examples of downstream options for warehouse data. Initially this chapter will look at Power BI, as that is the de facto visualization tool for any data but will also examine some other Microsoft products for data as it leaves the data warehouse such as Azure Analysis Services and Azure Cosmos DB.
Matt How

Backmatter

Additional information

Premium Partner

    Image Credits