2024 | Book

# Deep Generative Modeling

Author: Jakub M. Tomczak

Publisher: Springer International Publishing

2024 | Book

Author: Jakub M. Tomczak

Publisher: Springer International Publishing

This first comprehensive book on models behind Generative AI has been thoroughly revised to cover all major classes of deep generative models: mixture models, Probabilistic Circuits, Autoregressive Models, Flow-based Models, Latent Variable Models, GANs, Hybrid Models, Score-based Generative Models, Energy-based Models, and Large Language Models. In addition, Generative AI Systems are discussed, demonstrating how deep generative models can be used for neural compression, among others.

Deep Generative Modeling is designed to appeal to curious students, engineers, and researchers with a modest mathematical background in undergraduate calculus, linear algebra, probability theory, and the basics of machine learning, deep learning, and programming in Python and PyTorch (or other deep learning libraries). It should find interest among students and researchers from a variety of backgrounds, including computer science, engineering, data science, physics, and bioinformatics who wish to get familiar with deep generative modeling.

In order to engage with a reader, the book introduces fundamental concepts with specific examples and code snippets. The full code accompanying the book is available on the author's GitHub site: github.com/jmtomczak/intro_dgm

The ultimate aim of the book is to outline the most important techniques in deep generative modeling and, eventually, enable readers to formulate new models and implement them.

Advertisement

Abstract

Before we start thinking about (deep) generative modeling, let us consider a simple example. Imagine we have trained a deep neural network that classifies images (\(\mathbf {x} \in \mathbb {Z}^{D}\)) of animals (\(y \in \mathcal {Y}\), and \(\mathcal {Y} = \{cat, dog, horse\}\)). Further, let us assume that this neural network is trained really well so that it always classifies a proper class with a high probability p(y|x). So far so good, right? The problem could occur though. As pointed out in [1], adding noise to images could result in completely false classification. An example of such a situation is presented in Fig. 1.1 where adding noise could shift predicted probabilities of labels; however, the image is barely changed (at least to us, human beings).

Abstract

Let us imagine cats. Most people like cats, and some people are crazy in love with cats. There are ginger cats, black cats, big cats, small cats, puffy cats, and furless cats. In fact, there are many different kinds of cats. However, when I say this word: “a cat,” everyone has some kind of a cat in their mind. One can close eyes and generate a picture of a cat, either their own cat or a cat of a neighbor. Further, this generated cat is located somewhere, e.g., sleeping on a couch or in a garden chasing a fly, during the night or during the day, and so on. Probably, we can agree at this point that there are infinitely many possible scenarios of cats in some environments.

Abstract

Before we start discussing how we can model the distribution p(x), we refresh our memory about the core rules of probability theory, namely, the sum rule and the product rule. Let us introduce two random variables x and y.

Abstract

So far, we have discussed a class of deep generative models that model the distribution p(x) directly in an autoregressive manner. The main advantage of ARMs is that they can learn long-range statistics and, as a consequence, powerful density estimators. However, their drawback is that they are parameterized in an autoregressive manner; hence, sampling is rather a slow process. Moreover, they lack a latent representation; therefore, it is not obvious how to manipulate their internal data representation which makes it less appealing for tasks like compression or metric learning. In this chapter, we present a different approach to direct modeling of p(x). However, before we start our considerations, we will discuss a simple example.

Abstract

In the previous sections, we discussed two approaches to learning p(x): autoregressive models (ARMs) in Chap. 3 and flow-based models (or flows for short) in Chap. 4. Both ARMs and flows model the likelihood function directly, that is, either by factorizing the distribution and parameterizing conditional distributions p(x_{d}|x_{<d}) as in ARMs or by utilizing invertible transformations (neural networks) for the change of variables formula as in flows. Now, we will discuss a third approach that introduces latent variables.

Abstract

In Chap. 1, I tried to convince you that learning the conditional distribution p(y|x) is not enough and, instead, we should focus on the joint distribution p(x, y).

Abstract

So far, we have discussed various deep generative models for modeling the marginal distribution over observable variables (e.g., images), p(x), such as autoregressive models (ARMs), flow-based models (flows, for short), variational autoencoders (VAEs), and hierarchical models like hierarchical VAEs and diffusion-based deep generative models (DDGMs). However, from the very beginning, we advocate for using deep generative modeling in the context of finding the joint distribution over observables and decision variables that is factorized as p(x, y) = p(y|x)p(x). After taking the logarithm of the joint, we obtain two additive components: \(\ln p(\mathbf {x}, y) = \ln p(y | \mathbf {x}) + \ln p(\mathbf {x})\). We outlined how such a joint model could be formulated and trained in the hybrid modeling setting (see Chap. 6). The drawback of hybrid modeling though is the necessity of weighting both distributions, i.e., \(\ell (\mathbf {x}, y \lambda ) = \ln p(y | \mathbf {x}) + \lambda \ln p(\mathbf {x})\), and for λ ≠ 1, this objective does not correspond to the log-likelihood of the joint distribution. The question is whether it is possible to formulate a model to learn with λ = 1. Here, we are going to discuss a potential solution to this problem using probabilistic energy-based models (EBMs) (LeCun et al. (2006) Predict Struct Data 1).

Abstract

Once we discussed latent variable models, we claimed that they naturally define a generative process by first sampling latents z ∼ p(z) and then generating observables x ∼ p_{θ}(x|z). That is nice! However, the problem appears when we start thinking about training. To be more precise, the training objective is an issue. Why? Well, the probability theory tells us to get rid of all unobserved random variables by marginalizing them out.

Abstract

I must say that it is hard to come up with a shorter definition of concurrent generative modeling. Once we look at various classes of models, we immediately notice that this is exactly what we try to do: generate data from noise! Don’t believe me? Ok, we should have a look at how various classes of generative models work.

Abstract

In December 2020, Facebook reported having around 1.8 billion daily active users and around 2.8 billion monthly active users (Facebook reports fourth quarter and full year 2020 results, 2020.). Assuming that users uploaded, on average, a single photo each day, the resulting volume of data would give a very rough (let me stress it, a very rough) estimate of around 3000 TB of new images per day. This single case of Facebook alone already shows us the potential great costs associated with storing and transmitting data. In the digital era, we can simply say this: efficient and effective manner of handling data (i.e., faster and smaller) means more money in the pocket.

Abstract

How is it possible, my curious reader, that we can share our thoughts? How can it be that we discuss generative modeling, probability theory, or other interesting concepts? How come? The answer is simple: language. We communicate because the human species developed a pretty distinctive trait that allows us to formulate sounds in a very complex manner to express our ideas and experiences. At some point in our history, some people realized that we forget, we lie, and we can shout as strongly as we can, but we will not be understood farther than a few hundred meters. The solution was a huge breakthrough: writing. This whole mumbling on my side here could be summarized using one word: text. We know how to write (and read), and we can use the word text to mean language or natural language to avoid any confusion with artificial languages like Python or formal language.