Skip to main content

2017 | Buch

Docker for Data Science

Building Scalable and Extensible Data Infrastructure Around the Jupyter Notebook Server

insite
SUCHEN

Über dieses Buch

Learn Docker "infrastructure as code" technology to define a system for performing standard but non-trivial data tasks on medium- to large-scale data sets, using Jupyter as the master controller.
It is not uncommon for a real-world data set to fail to be easily managed. The set may not fit well into access memory or may require prohibitively long processing. These are significant challenges to skilled software engineers and they can render the standard Jupyter system unusable.

As a solution to this problem, Docker for Data Science proposes using Docker. You will learn how to use existing pre-compiled public images created by the major open-source technologies—Python, Jupyter, Postgres—as well as using the Dockerfile to extend these images to suit your specific purposes. The Docker-Compose technology is examined and you will learn how it can be used to build a linked system with Python churning data behind the scenes and Jupyter managing these background tasks. Best practices in using existing images are explored as well as developing your own images to deploy state-of-the-art machine learning and optimization algorithms.
What You'll Learn Master interactive development using the Jupyter platform
Run and build Docker containers from scratch and from publicly available open-source images
Write infrastructure as code using the docker-compose tool and its docker-compose.yml file type
Deploy a multi-service data science application across a cloud-based system

Who This Book Is For
Data scientists, machine learning engineers, artificial intelligence researchers, Kagglers, and software developers

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction
Abstract
The typical data scientist consistently has a series of extremely complicated problems on their mind beyond considerations stemming from their system infrastructure. Still, it is inevitable that infrastructure issues will present themselves. To oversimplify, we might draw a distinction between the “modeling problem” and the “engineering problem.” The data scientist is uniquely qualified to solve the former, but can often come up short in solving the latter.
Joshua Cook
Chapter 2. Docker
Abstract
Docker is a way to isolate a process from the system on which it is running. It allows us to isolate the code written to define an application and the resources required to run that application from the hardware on which it runs. We add a layer of complexity to our software, but in doing so gain the advantage of ensuring that our local development environment will be identical to any possible environment into which we would deploy the application. If a system can run Docker, a system can run our process. With the addition of a thin layer of abstraction we become hardware independent. On its face, this would seem to be an impossible task. As of 2014, there were 285 actively maintained Linux distributions and multiple major versions of both OS X and Windows. How could we possibly write a system to allow for all possible development, testing, and production environments?
Joshua Cook
Chapter 3. Interactive Programming
Abstract
Interactive computing is a dialog between people and machines.
Joshua Cook
Chapter 4. The Docker Engine
Abstract
If I have not emphasized this enough, the magic happens because we can count on the Docker engine to work the same way no matter our underlying hardware (or virtual hardware) and operating system. We build it using the Docker engine, we test it using the Docker engine, and we deploy it using the Docker engine.
Joshua Cook
Chapter 5. The Dockerfile
Abstract
Every Docker image is defined as a stack of layers, each defining fundamental, stateless changes to the image. The first layer might be the virtual machine's operating system (a Debian or Ubuntu Docker image), the next the installation of dependencies necessary for your application to run, and all the way up to the source code of your application.
Joshua Cook
Chapter 6. Docker Hub
Abstract
Equipped with tools for developing our own images, it quickly becomes important to be able to save and share the images we have written beyond our system. Docker Registries allow us to do just this. For your purposes, the public Docker Registry, Docker Hub, will be more than sufficient, though it is worth noting that other registries exist and that it is possible to create and host your own registry.
Joshua Cook
Chapter 7. The Opinionated Jupyter Stacks
Abstract
The Jupyter Notebook is based on a set of open standards for interactive computing.
Joshua Cook
Chapter 8. The Data Stores
Abstract
I propose that using Docker, it is possible to streamline the process to an extent that using a data store for even the smallest of datasets becomes a practical matter. I’ll show you a series of best practices for designing and deploying data stores, a set of practices that will be sufficient for working with all but the largest of data sets. Conforming to Docker best practice, you will work with Docker Hub official images throughout this chapter.
Joshua Cook
Chapter 9. Docker Compose
Abstract
Thus far, I have focused the discussion on single containers or individually managed pairs of containers running on the same system. In this chapter, you’ll extend your ability to develop applications comprised of multiple containers using the Docker Compose tool.
Joshua Cook
Chapter 10. Interactive Software Development
Abstract
The most famous of these might be the Rails framework for the Ruby language. Rails is written from the ground up around its adopted paradigm, the Model-View-Controller design pattern, a pattern heavily favored in the implementation of user-facing software.
Joshua Cook
Backmatter
Metadaten
Titel
Docker for Data Science
verfasst von
Joshua Cook
Copyright-Jahr
2017
Verlag
Apress
Electronic ISBN
978-1-4842-3012-1
Print ISBN
978-1-4842-3011-4
DOI
https://doi.org/10.1007/978-1-4842-3012-1

Premium Partner