Skip to main content
main-content

Über dieses Buch

Data engineers who need to hit the ground running will use this book to build skills in Azure Data Factory v2 (ADF). The tutorial-first approach to ADF taken in this book gets you working from the first chapter, explaining key ideas naturally as you encounter them. From creating your first data factory to building complex, metadata-driven nested pipelines, the book guides you through essential concepts in Microsoft’s cloud-based ETL/ELT platform. It introduces components indispensable for the movement and transformation of data in the cloud. Then it demonstrates the tools necessary to orchestrate, monitor, and manage those components.
The hands-on introduction to ADF found in this book is equally well-suited to data engineers embracing their first ETL/ELT toolset as it is to seasoned veterans of Microsoft’s SQL Server Integration Services (SSIS). The example-driven approach leads you through ADF pipeline construction from the ground up, introducing important ideas and making learning natural and engaging. SSIS users will find concepts with familiar parallels, while ADF-first readers will quickly master those concepts through the book’s steady building up of knowledge in successive chapters. Summaries of key concepts at the end of each chapter provide a ready reference that you can return to again and again.

What You Will LearnCreate pipelines, activities, datasets, and linked servicesBuild reusable components using variables, parameters, and expressionsMove data into and around Azure services automaticallyTransform data natively using ADF data flows and Power Query data wranglingMaster flow-of-control and triggers for tightly orchestrated pipeline executionPublish and monitor pipelines easily and with confidence

Who This Book Is For
Data engineers and ETL developers taking their first steps in Azure Data Factory, SQL Server Integration Services users making the transition toward doing ETL in Microsoft’s Azure cloud, and SQL Server database administrators involved in data warehousing and ETL operations

Inhaltsverzeichnis

Frontmatter

Chapter 1. Creating an Azure Data Factory Instance

Abstract
A major responsibility of the data engineer is the development and management of extract, transform, and load (ETL) and other data integration workloads. Real-time integration workloads process data as it is generated – for example, a transaction being recorded at a point-of-sale terminal or a sensor measuring the temperature in a data center. In contrast, batch integration workloads run at intervals, usually processing data produced since the previous batch run.
Richard Swinbank

Chapter 2. Your First Pipeline

Abstract
ETL workloads are implemented in Azure Data Factory in units called pipelines. Using the data factory instance you created in Chapter 1, in this chapter, you will create a pipeline using the Copy Data tool – a pipeline creation wizard that steps through creating the various components that make up a pipeline. Afterward, you’ll be able to examine the pipeline in detail to gain an understanding of how it is constructed.
Richard Swinbank

Chapter 3. The Copy Data Activity

Abstract
Data integration tasks can be divided into two groups: those of data movement and those of data transformation. In Chapter 2, you created an Azure Data Factory pipeline that copied data from one blob storage container to another – a simple data movement using the Copy data activity. The Copy data activity is the core tool in Azure Data Factory for moving data from one place to another, and this chapter explores its application in greater detail.
Richard Swinbank

Chapter 4. Expressions

Abstract
The pipelines you authored in Chapter 3 all have at least one thing in common: the values of all their properties are static – that is to say that they are determined at development time. In very many places, Azure Data Factory supports the use of dynamic property values – determined at runtime – through the use of expressions.
Richard Swinbank

Chapter 5. Parameters

Abstract
Chapter 4 introduced expressions as a way of setting property values in factory resources at runtime. The examples presented used expressions to determine values for a variety of properties, all of which were under the internal control of the pipeline. But sometimes it is convenient to be able to inject external values into factory resources at runtime, either to share data or to create generic resources which can be reused in multiple scenarios. Injection of runtime values is achieved by using parameters.
Richard Swinbank

Chapter 6. Controlling Flow

Abstract
In Chapter 4, you began building ADF pipelines containing more than one activity, controlling their order of execution using activity dependencies configured between them. Activity dependencies are among a range of tools available in ADF for controlling a pipeline’s flow of execution.
Richard Swinbank

Chapter 7. Data Flows

Abstract
The Copy data activity is a powerful tool for moving data between data storage systems, but it has limited support for data transformation. Columns can be added to the activity’s source configuration, or removed by excluding them from the source to sink mapping, but the activity does not support manipulation of individual rows or allow data sources to be combined or separated.
Richard Swinbank

Chapter 8. Integration Runtimes

Abstract
ADF linked services represent connections to storage or compute resource – decoupled from one another in the way described in Chapter 2 – that are external to Azure Data Factory. The linked services you have been using in previous chapters represent connections to external storage, and access to external compute (such as HDInsight or Azure Databricks) is managed in the same way.
Richard Swinbank

Chapter 9. Power Query in ADF

Abstract
ADF data flows, covered in Chapter 7, model a data integration process as a stream of data rows that undergoes a succession of transformations to produce an output dataset. This approach to conceptualizing ETL operations is long established and may be familiar from other tools, including SQL Server Integration Services. While powerful, this view of a process can be inconvenient – when a new, unknown source dataset is being evaluated and understood, for example, or for users new to data engineering.
Richard Swinbank

Chapter 10. Publishing to ADF

Abstract
The relationship between the ADF UX, Git, and your development data factory, first introduced in Chapter 1, is shown in Figure 10-1. In subsequent chapters, you have been authoring factory resources in the ADF UX, then saving them to the Git repository linked to your development factory, and running them in Debug mode using the development factory’s compute (integration runtimes). Those interactions are shown in Figure 10-1 as dashed arrows.
Richard Swinbank

Chapter 11. Triggers

Abstract
In Chapter 10, you explored how to deploy Azure Data Factory resources into published factory environments. You tested running one or more published pipelines by executing them manually from the ADF UX – in this chapter, you will explore how pipelines can be executed automatically using triggers.
Richard Swinbank

Chapter 12. Monitoring

Abstract
The previous two chapters have been largely concerned with what happens to Azure Data Factory resources after you have finished developing them – how to get them into a production environment and how to run them automatically. This final chapter completes a trio of requirements for operating a production ADF instance: monitoring the behavior of deployed factory resources to ensure that individual resources and the factory as a whole continue to operate correctly.
Richard Swinbank

Backmatter

Weitere Informationen

Premium Partner