01122019  Research  Issue 1/2019 Open Access
Cosmological Nbody simulations: a challenge for scalable generative models
Important notes
Electronic Supplementary Material
The online version of this article (https://doi.org/10.1186/s4066801900321) contains supplementary material.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
The recent advances in the field of deep learning have initiated a new era for generative models. Generative Adversarial Networks (GANs) (Goodfellow et al.
2014) have become a very popular approach by demonstrating their ability to learn complicated representations to produce highresolution images (Karras et al.
2018). In the field of cosmology, highresolution simulations of matter distribution are becoming increasingly important for deepening our understanding of the evolution of the structures in the universe (Springel et al.
2005; Potter et al.
2017; Kuhlen et al.
2012). These simulations are made using the
Nbody technique, which represents the distribution of matter in 3D space by trillions of particles. They are very slow to run and computationally expensive, as they evolve the positions of particles over cosmic time in small time intervals. Generative models have been proposed to emulate this type of data, dramatically accelerating the process of obtaining new simulations, after the training is finished (Rodríguez et al.
2018; Mustafa et al.
2019).
Nbody simulations represent the matter in a cosmological volume, typically between 0.1–10 Gpc, as a set of particles, typically between 100
^{3} to 2000
^{3}. The initial 3D positions of the particles are typically drawn from a Gaussian random field with a specific power spectrum. Then, the particles are displaced over time according to the laws of gravity, properties of dark energy, and other physical effects included in the simulations. During this evolution, the field is becoming increasingly nonGaussian, and displays characteristic features, such as halos, filaments, sheets, and voids (Bond et al.
1996; Dietrich et al.
2012).
Advertisement
Nbody simulations that consist only of dark matter effectively solve the Poisson’s equation numerically. This process is computationally expensive, as the forces must be recalculated in short time intervals to retain the precision of the approximation. This leads to the need for frequent updates of the particle positions. The speed of these simulations is a large computational bottleneck for cosmological experiments, such as the Dark Energy Survey,
^{1} Euclid,
^{2} or LSST.
^{3}
Recently, GANs have been proposed for emulating the matter distributions in two dimensions (Rodríguez et al.
2018; Mustafa et al.
2019). These approaches have been successful in generating data of high visual quality, and almost indistinguishable from the real simulations to experts. Moreover, several summary statistics often used in cosmology, such as power spectra and density histograms, also revealed good levels of performance. Some challenges still remain when comparing sets of generated samples. In both works, the properties of sets of generated images did not match exactly; the covariance matrix of power spectra of the generated maps differed by order of 10% with the real maps.
While these results are encouraging, a significant difficulty remains in scaling these models to generate threedimensional data, which include several orders of magnitude more pixels for a single data instance. We address this problem in this work. We present a publicly available dataset of
Nbody cubes, consisting of 30 independent instances. Due to the fact that the dark matter distribution is homogeneous and isotropic, and that the simulations are made using periodic boundary condition, the data can be easily augmented through shifts, rotations, and flips. The data is in the form of a list of particles with spatial positions
x,
y,
z. It can be pixelised into 3D histogram cubes, where the matter distribution is represented in density voxels. Each voxel contains the count of particles falling into it. If the resolution of the voxel cube is high enough, the particle and voxelbased representations should be able to be used interchangeably for most of the applications. Approaches to generate the matter distribution in the particlebased representation could also be designed; in this work, however, we focus on the voxelbased representation. By publishing the
Nbody data and the accompanying codes we aim to encourage the development of large scale generative models capable of handling such data volumes.
We present a benchmark GAN system to generate 3D
Nbody voxel cubes. Our design of the novel GAN architecture scales to volumes of 256
^{3} voxels. Our proposed solution relies on two key building blocks. First, we split the generation of the highdimensional data into smaller patches. Instead of assuming that the distribution of each patch is independent of the surrounding context, we model it as a function of the neighboring patches. Although splitting the generation process into patches provides a scalable solution to generate images of arbitrary size, it also limits the field of view of the generator, reducing its ability to learn global image features. The second core idea of our method addresses this problem by relying on a multiscale approach that efficiently captures global dependencies that might otherwise be lost in the splitting process.
Advertisement
Our results constitute a baseline solution to the challenge. While the obtained statistical accuracy is currently insufficient for a real cosmological use case, we achieve two goals: (i) we demonstrate that the project is tractable by GAN architectures, and (ii) we provide a framework for evaluating the performance of new algorithms in the future.
1.1 Related work
Generative models that produce novel representative samples from highdimensional data distributions are increasingly becoming popular in various fields such as imagetoimage translation (Zhu et al.
2017), or image inpainting (Iizuka, SimoSerra and Ishikawa
2017) to name a few. There are many different deep learning approaches to generative models. The most popular ones are Variational AutoEncoders (VAE) (Kingma and Welling
2014), Autoregressive models such as PixelCNN (van den Oord et al.
2016), and Generative Adversarial Networks (GAN) (Goodfellow et al.
2014). Regarding prior work for generating 3D images or volumes, two main types of architectures – in particular GANs – have been proposed. The first type (Achlioptas et al.
2018; Fan et al.
2017) generates 3D point clouds with a 1D convolutional architecture by producing a list of 3D point positions. This type of models does not scale to cases where billion of points are present in a simulation, posing an important concern given the size of current and future
Nbody simulations. The second type of approaches, including Wu et al. (
2016), Mosser et al. (
2017), directly uses 3D convolutions to produce a volume. Although the computation and memory cost is independent of the number of particles, it scales with the number of voxels of the desired volume, which grows cubically with the resolution. While recursive models such as PixelCNN (van den Oord et al.
2016) can scale to some extent, they are slow to generate samples, as they build the output image pixelbypixel in a sequential manner. We take inspiration from PixelCNN to design a patchbypatch approach, rather than a pixelbypixel approach, which significantly speeds up the generation of new samples.
As mentioned above, splitting the generation process into patches reduces the ability of the generator to learn global image features. Some partial solutions to this problem can already be found in the literature, such as the Laplacian pyramid GAN (Denton et al.
2015) that provides a mechanism to learn at different scales for high quality sample generation, but this approach is not scalable as the sample image size is still limited. Similar techniques are used in the problem of superresolution (Ledig et al.
2017; Lai et al.
2017; Wang et al.
2018). Recently, progressive growing of GANs (Karras et al.
2018) has been proposed to improve the quality of the generated samples and stabilize the training of GANs. The size of the samples produced by the generator is progressively increased by adding layers at the end of the generator and at the beginning of the discriminator. In the same direction, Brock et al. (
2019), Lučić et al. (
2019) achieved impressive quality in the generation of large images by leveraging better optimization. Problematically, the limitations of the hardware on which the model is trained occur after a certain increase in size and all of these approaches will eventually fail to offer the scalability we are after.
GANs were proposed for generating matter distributions in 2D. A generative model for the projected matter distribution, also called a
mass map, was introduced by Mustafa et al. (
2019). Mass maps are cosmological observables, as they are reconstructed by techniques such as, for example, gravitational lensing (Chang et al.
2018). Mass maps arise through integration of the matter density over the radial dimension with a specific, distancedependent kernel. The generative model presented in Mustafa et al. (
2019) achieved very good agreement with the real data several important nonGaussian summary statistics: power spectra, density histograms, and Minkowski functionals (Schmalzing et al.
1996). The distributions of these summaries between sets of generated and real data also agreed well. However, the covariance matrix of power spectra within the generated and real sets did not match perfectly, differing by the order of 10%.
A generative model working on 2D slices from
Nbody simulations was developed by Rodríguez et al. (
2018).
Nbody slices have much more complex features, such as filaments and sheets, as they are not averaged out in projection. Moreover, the dynamic range of pixel values spans several orders of magnitude. GANs presented by Rodríguez et al. (
2018) also achieved good performance, but only for larger cosmological volumes of 500 Mpc. Some mismatch in the power spectrum covariance was also observed.
Alternative approaches to emulating cosmological matter distributions using deep learning have been recently been proposed. Deep Displacement Model (He et al.
2018) uses a Ushaped neural network that learns how to modify the positions of the particles from initial conditions to a given time in the history of the universe.
Generative models have also been proposed for solving other problems in cosmology, such as generation of galaxies (Regier et al.
2015), adding baryonic effects to the dark matter distribution (Tröster et al.
2019), recovery of certain features from noisy astrophysical images (Schawinski et al.
2017), deblending galaxy superpositions (Reiman and Göhre
2019), improving resolution of matter distributions (Kodi Ramanah et al.
2019).
2 The Nbody data
2.1 Cosmological Nbody simulations
The distribution of matter, dark matter and other particles in the universe at large scale, under the influence of gravity, forms a convoluted networklike structure called the cosmic web (Bond et al.
1996; Coles and Chiang
2000; ForeroRomero et al.
2009; Dietrich et al.
2012). This distribution contains information vital to the study of dark matter, dark energy, and the very laws of gravity (Abbott et al.
2018; Hildebrandt et al.
2017; Joudaki et al.
2017). Simulations of these various computational cosmological models (Springel
2005; Potter et al.
2017) lead to understanding of the fundamentals of cosmological measurements (Fosalba et al.
2015; Busha et al.
2013), and other properties of the universe (Springel et al.
2005). These simulations are done using
Nbody techniques.
Nbody techniques simulate the cosmic web using a set of particles in three dimensional space, and evolve their positions with time. This evolution is governed by the underlying cosmological model and the laws of gravity. The end result of an
Nbody simulation is the position of billions of particles in space, as depicted in Fig.
1. Unfortunately,
Nbody simulations are extremely resource intensive, as they require days, or even weeks of computation to produce them (Teyssier et al.
2009; BoylanKolchin et al.
2009). Moreover, a large number of these
Nbody simulations is needed to obtain good statistical accuracies, which further increases the computational requirements.
×
This computational bottleneck opens up a leeway for deep learning and generative models to offer an alternative solution to the problem. Generative models have the potential to be able to learn the underlying data distribution of the
Nbody simulations using a seed set of
Nbody samples to train on.
There are multiple techniques for running
Nbody simulations, which agree well for large scales, but start to diverge for small scales, around wavenumber
\(k=1~\mbox{Mpc}^{1}\) (Schneider et al.
2016). Moreover, baryonic feedback can also affect the small scale matter distribution (Mead et al.
2015; Huang et al.
2019; Barreira et al.
2019), and large uncertainty remains for these scales.
2.2 Data preprocessing
We produce samples of the cosmic web using standard
Nbody simulation techniques. We used LPICOLA Howlett et al. (
2015) to create 30 independent simulations. The cosmological model used was
ΛCDM with Hubble constant
\(\mathrm{H}_{0} = 500h = 350~\mbox{km/s/Mpc}\),
^{4} dark energy density
\(\varOmega _{\varLambda }= 0.724\) and matter density
\(\varOmega _{m} = 0.276\). We used the particle distribution at redshift
\(z = 0\). The output of the simulator is a list of particles 1024
^{3} 3D positions. To obtain the matter distribution, we first convert it to a standard 256
^{3} 3D cube using histogram binning. We consider these cubes as the raw data for our challenge and can be downloaded at
https://zenodo.org/record/1464832. The goal is to build a generative model able to produce new 3D histograms. While 30 samples might seem as a low number of samples to train a deep neural network, each sample contains a large number of voxels. One can also expand the training data by relying on data augmentation, using various rotations and circular translations as described in Appendix
2. Problematically, the dynamic range of this raw data spans several orders of magnitude and the distribution is skewed towards smaller values, with a very elongated tail towards the larger values. Empirically, we find that this very skewed distribution makes learning a generative model difficult. Therefore, we first transform the data using a logarithmbased function, as described in Appendix
1. Note that this transform needs to be inverted before the evaluation procedure.
2.3 Evaluation procedure
The evaluation of a GAN is not a simple task Borji (
2019), Grnarova et al. (
2019). Fortunately, following Rodríguez et al. (
2018), we can evaluate the quality of the generated samples with three summary statistics commonly used in the field of cosmology.
Other statistics such as Minkowski functionals, three point correlation functions, or phase distributions, could be considered. Nevertheless, we find that the three aforementioned statistics are currently sufficient to compare different generative models.
1.
Mass histogram is the average (normalized) histogram of the pixel value in the image. Note that the pixel value is proportional to the matter density.
2.
Power spectrum density (or PSD) is the amplitude of the Fourier modes as a function of their frequency (the phase is ignored). Practically, a set of bins is defined and the modes of similar frequency are averaged together.
3.
Peak histogram is the distribution of maxima in the density distribution, often called “peak statistics”, or “mass function”. Peaks capture the nonGaussian features present in the cosmic web. This statistic is commonly used on weak lensing data (Kacprzak et al.
2016; Martinet et al.
2017). A peak is defined as a pixel greater than every pixel in its 2pixels neighbourhood (24 pixels for 2D and 124 for 3D).
Distance between statistics
We define a score that reflects the agreement of the 3 aforementioned cosmological measures. Problematically, the scalars forming the 3 vectors representing them have very different scales and their metrics should represent the relative error instead of the absolute one. For this reason, we first compute the logarithm (in base 10) of the computed statistics
s. As all statistics are positive but not strictly positive, we add a small value
ϵ before computing the logarithm, i.e.,
\(s_{\log }= \log _{10}(s+\epsilon )\).
ϵ is set to the maximum value of the statistic averaged over all real samples divided by 10
^{5}.
At this point, the relative difference connects to the difference of the real and fake
\(s_{\log }\), i.e.:
\({s_{\log }^{r}s_{\log }^{f}\approx \log _{10}\frac{s^{r}}{s^{f}}}\). One could quantify the error using a norm, i.e.:
\(\\mathbb {E}{s_{\log }^{r}}\mathbb {E}{s_{\log }^{f}}\\). However, such a distance does not take into account secondorder moments. Rather, we take inspiration from the Fréchet Inception Distance (Heusel et al.
2017). We start by modeling the real and fake log statistics
\(s_{\log }^{r}\),
\(s_{\log }^{f}\) as two multivariate Gaussian distributions. This allows us to compute the Fréchet Distance (FD) between the two distributions (Fréchet
1957), which is also the Wasserstein2. The FD between two Gaussian distribution with mean and covariance
\((m^{r},C ^{r})\) and
\((m^{f}, C^{f})\) is given by Dowson and Landau (
1982):
Note that Dowson and Landau (
1982) also proves that
\(\operatorname{Tr} ( C^{r} + C^{f}  2 C^{r} C^{f} )\) is a metric for the space of covariance matrices. We choose the FD over the Kullback–Leibler (KL) divergence for two reasons: (a) the Wasserstein distance is still an appropriate distance when distributions have nonoverlapping support and (b) the KL is computationally more unstable since the covariance matrices need to be inverted.
$$ \begin{aligned}[b] &d^{2} \bigl(\bigl(m^{r},C^{r} \bigr),\bigl(m^{f}, C^{f}\bigr) \bigr) \\ &\quad = \bigl\Vert m^{r}  m^{f} \bigr\Vert _{2}^{2} + \operatorname{Tr} \bigl( C^{r} + C^{f}  2 C^{r} C^{f} \bigr). \end{aligned} $$
(1)
2.4 NBody mass map generation challenge
Using the FD, we define a
score for each statistic as
where
m,
C are computed on the log statistics, i.e:
\(m^{r}=\mathbb {E}{s _{\log }^{r}}\),
\(m^{f}=\mathbb {E}{s_{\log }^{f}}\). Along with this manuscript, we release our dataset and our evaluation procedure in the hope that further contributions will improve our solution. All information can be found at
https://github.com/nperraud/3DcosmoGAN. We hope that this dataset and evaluation procedure can be a tool for the evaluation of GANs in general. Practically, these cosmological statistics are very sensitive. We observed two important properties of this problem. First, a small variation in the generated images still has an impact on the statistics. The statistics can be highly affected by high density regions of the
Nbody data, and these regions are also the most rare in the training set. Second, while mode collapse may not directly affect the mean of the statistics, it can affect their second order moment significantly. We observed that obtaining a good statistical agreement (and hence a high score), is much more difficult than obtaining generated images that are indistinguishable for the human eye, especially for the 2dimensional case. We found that the problems of high data volume, large dynamic range of the data, and strict requirement on good agreement in statistical properties of real and generated samples, pose significant challenge for generative models.
$$ S^{*} = \frac{1}{d^{2}((m^{r},C^{r}),(m^{f}, C^{f})}, $$
(2)
Interpretation of the
\(S^{*}\) scores
We report the
\(S^{*}\) scores for the 3D case in Sect.
4 and the web page hosting the source code also includes baseline scores for the 2D case. We would like to emphasize that these scores are mostly suitable to compare two generative models applied to the
same distribution. It is a priori unclear whether these scores provide a meaningful comparison when considering two generative models trained on different distributions. We therefore refrain from relying on such scores to compare the generative models trained on 2D and 3D data since the latter gives access to correlations along the third dimension that are absent from the 2D data. Finally, while in theory, the
\(S^{*}\) score is unbounded and can be arbitrarily large as
\(d^{2}\rightarrow 0\), its empirical evaluation is, in practice, limited by the estimation error of
\(d^{2}\) that depends on the number of samples used to estimate the mean vectors and the covariance matrices
\((m^{r},C^{r})\) as well as the moments of the estimated statistics. In the case where the training set contains a large number of samples, one would expect the training score to be indicative of the score of the true distribution. One can see that the estimation error of the first term of (
1) depends mostly on the variance of the statistic, while for the second term, it depends on the moments of order 3 and 4. Hence, the estimated score will reflect the variance of the estimated statistics given the number of samples used. A high score therefore means less variance within the corresponding statistic.
3 Sequential generative approach
We propose a novel approach to efficiently learn a Generative Adversarial Network model (see Sect.
3.1) for 2D and 3D images of arbitrary size. Our method relies on two building blocks: (1) a multiscale model that improves the quality of the generated samples, both visually and quantitatively, by learning unique features at different scales (see Sect.
3.2), and (2) a training strategy that enables learning images of arbitrary size, that we call “conditioning on neighborhood” (see Sect.
3.3).
3.1 Generative adversarial networks (GANs)
Generative Adversarial Networks (GAN) rely on two competing neural networks that are trained simultaneously: the generator G, which produces new samples, and the discriminator D, which attempts to distinguish them from the real ones. During training, it is the generator’s objective to fool the discriminator, while the discriminator resists by learning to accurately discriminate real and fake data. Eventually, if the optimization process is carried out successfully, the generator should improve to the point that its generated samples become indistinguishable from the real one. In practice, this optimization process is challenging and numerous variants of the original GAN approach have been proposed, many of them aiming to improve stability including e.g. Roth et al. (
2017), Gulrajani et al. (
2017), Arjovsky et al. (
2017). In our work, we rely on the improved Wasserstein GAN (WGAN) approach introduced in Gulrajani et al. (
2017). The latter optimizes the Wasserstein distance instead of the JensenShannon divergence and penalizes the norm of gradient of the critic instead of using a hard clipping as in the original WGAN (Arjovsky et al.
2017). The resulting objective function is
where
\(\mathbb{P}_{r}\) is the data distribution and
\(\mathbb{P}_{g}\) is the generator distribution implicitly defined by
,
\(\mathrm{z} \sim p(\mathrm{z})\). The latent variable z is sampled from a prior distribution
p, typically a uniform or a Gaussian distribution. Eventually,
is defined implicitly by sampling uniformly along straight lines between pair of points sampled from the true data distribution
\(\mathbb{P}_{r}\) and the generator distribution
\(\mathbb{P}_{g}\). The weight
λ is the penalty coefficient.
3.2 Multiscale model
Our multiscale approach is inspired by the Laplacian pyramid GAN (Denton et al.
2015). We refer to three image types of different sizes, namely
\(I_{3} = 32 \times 32 \times 32\),
\(I_{2} = 64 \times 64 \times 64\),
\(I_{1} = 256 \times 256 \times 256\)
^{5} pixels, where
\(I_{2}\) is a downscaled version of
\(I_{1}\) and
\(I_{3}\) is a downscaled version of
\(I_{2}\). The multiscale approach is shown in Fig.
2 and uses three different GANs,
\(M_{1}\),
\(M_{2}\) and
\(M_{3}\), all trained independently from each other, and can therefore be trained in parallel. We train GAN
\(M_{3}\) to learn the data distribution of images
\(I_{3}\), while the GANs
\(M_{2}\) and
\(M_{1}\) are conditioned on the images produced by
\(M_{3}\) and
\(M_{2}\), respectively. In our implementation, we take
\(M_{3}\) to be a normal Deep Convolution GAN (DCGAN) that learns to produce downscaled samples of size
\(I_{3}\). We design GANs
\(M_{2}\) and
\(M_{1}\) to have the following properties: (1) they produce outputs using a sequential patchbypatch approach and (2) the output of each patch is conditioned on the neighboring patches. This procedure allows handling of high data volume, while preserving the longrange features in the data. Moreover, different GANs learn salient features at different scales, which contribute to an overall improved quality of the samples produced the final GAN
\(M_{1}\). Further details regarding the implementation details are provided in Appendix
3.
×
3.3 Conditioning on neighborhoods
The limited amount of memory available to train a GAN generator makes it impractical to directly produce large image samples. Using current modern GPUs with 16 GB of RAM and a stateoftheart network architecture, the maximum sample size we were allowed to use was 32
^{3}, which is far from our target taken to be 256
^{3}. In order to circumvent this limitation, we propose a new approach that produces the full image (of size 256
^{3}) patchbypatch, each patch being of smaller size (32
^{3} in our case). This approach is reminiscent of the PixelCNN (van den Oord et al.
2016), where 2D images are generated pixelbypixel, rather than the entire picture being generated at once. Instead of assuming that the distribution of each patch is independent of the surrounding context, we model it as a function of the neighboring patches. The generation process is done using a rasterscan order, which implies that a patch depends on the neighboring patches produced before the current patch. The process illustrated in Fig.
3 is for the 2D case with three neighboring patches; the generalization to three dimensions is straightforward as it simply requires seven neighboring 3D patches.
×
In the generator, the neighborhood information, i.e. the borders (3 patches for 2D, 7 cubes for 3D), is first encoded using several convolutional layers. Then it is concatenated with the latent variable, is inputed to a fully connected layer before being reshaped into a 3D image. At this point, the downsampled version of the image is concatenated. After a few extra convolutional layers, the generator produces the lowest rightmost patch with the aim of making it indistinguishable to the patch from the real data. The generator architecture is detailed in Fig.
4. In the case where there is no neighborhood information available, such as at the boundary of a cube, we pad with zeros. The discriminator is given images containing four patches where the lower right patch is either the real data or the fake data generated by the generator. The generator only produces patches of size 32
^{3}, irrespective of the size of the original image. This way this method can scale to any image size, which is a great advantage. The discriminator only has access to a limited part of the image and ignores the longrange dependencies. This issue, however, is handled by the multiscale approach described in the previous section. We actually tried a model only conditioned on the neighborhoods as detailed in Appendix
4.3. It ended up performing significantly worse than using the multiscale approach.
×
4 Experimental results
Our approach relies on a recursive approach that progressively builds a 3D cube, starting from a lowresolution and upsampling it at every step. We detail and analyze each step separately in order to understand the impact of each of these steps on the final performance of the full model. We also compare the results of our multiscale approach to a simpler uniscale model. Additional details regarding the network architectures and various implementation choices are available in Appendix
3.
4.1 Scale by scale analysis of the pipeline
In the following, we describe our model that relies on three different GANs, namely
\(M_{1}\),
\(M_{2}\) and
\(M_{3}\), to generate samples at distinct resolutions. We detail each step of the generation process below.
Step 1: lowscale generation (latent code to
\(I_{3}\))
The low scale generation of a sample of size 32
^{3} is performed by the WGAN
\(M_{3}\). The architecture of both the generator and the discriminator is composed a 5 3D convolutional layers with kernels of size
\({4 \times 4 \times 4}\). We use leaky ReLu nonlinearity. Further details can be found in Table
2 of Appendix
3.
Figure
5 shows the middle slice from 16 3D samples
\(I_{3}\) drawn from the generator of
\(M_{3}\) compared to real samples. In Fig.
6, we additionally plot our evaluation statistics for the 30 samples corresponding to the total number of
Nbody simulations used to build the dataset. The generated samples drawn from
\(M_{3}\) are generally similar to the true data, both from a visual and from a statistical point of view, although one can observe slight disagreements at higher frequencies. Note that the low number of samples and the size of each of them does not allow us to compute very accurate statistics, hence limiting our evaluation.
×
×
Step 2: upscale (
\(I_{3}\) to
\(I_{2}\))
Upscaling the sample size from 32
^{3} to 64
^{3} is performed by the WGAN
\(M_{2}\). The architecture is similar to
\(M_{3}\) (see Table
3 in Appendix
3), except that the border patches are first encoded using three 3D convolutional layers and then concatenated with the latent variables before being fed to the generator.
In order to visualize the quality of upscaling achieved by this first upsampling step independently from the rest of the pipeline, we first downscale the real 256
^{3} samples to 32
^{3}, and then provide them as input
\(I_{3}\) to the WGAN
\(M_{2}\). We then observe the result of the upscaling to 64
^{3}. Figure
7 shows slices from some generated
\(I_{2}\) samples, as well as the real downscaled
\(I_{3}\) image and the real
\(I_{2}\) image. We observe a clear resemblance between the upscaled fake samples and the real samples. The statistics for this step of the generation process are shown in Fig.
8. We observe more significant discrepancies for features that rarely occur in the training set such as for large peaks. This is however not surprising as learning from few examples is in general a difficult problem.
×
×
Step 3: upscale (
\(I_{2}\) to
\(I_{1}\))
The final upscaling from 64
^{3} to 256
^{3} is performed by the WGAN
\(M_{1}\). The architecture of both the generator and the discriminator of
\(M_{1}\) is composed of eight 3D convolutional layers with inception modules Szegedy et al. (
2014). The inception modules have filters of three different sizes:
\(1 \times 1 \times 1\),
\(2 \times 2 \times 2\) and
\(4 \times 4 \times 4\). The input tensor is convolved with all three types of filters, using padding to keep the output shape the same as the input shape. Eventually, the three outputs are summed to recover the desired number of output channels.
To visualize the quality of upscaling achieved by the final up sampling step, we downscale the real 256
^{3} samples to 64
^{3}, and then provide them as inputs
\(I_{2}\) to the WGAN
\(M_{1}\). Figure
9 shows the middle slices of two real and fake samples
\(I_{1}\). Although, the upsampling factor is 4, the WGAN
\(M_{1}\) is able to produce convincing samples even in terms of high frequency components.
×
4.2 Full multiscale pipeline
Sample generation
The generation process used to produce new samples proceeds as follows. First, a latent variable is randomly drawn from the prior distribution and is fed to
\(M_{3}\) which produces a 32
^{3} lowresolution cube. The latter is then upsampled by
\(M_{2}\). At first, all border patches shown in Fig.
3 are set to zero. Then the 64
^{3} cube is built recursively (in
\(2^{3}=8\) steps) where at each step, the previously generated patches are refed as borders into the generator. The generation of the full 256
^{3} cube is done in a similar fashion by
\(M_{1}\). Note that this last step requires the generation of
\(8^{3}=512\) patches/smaller cubes. An illustration, adapted to the 2D case for simplicity, is shown in Fig.
10. The full generation process takes around 7 seconds to produce one sample of size 256
^{3} using a single GPU node with 24 cores compared to approximately 68 hours for a fast and approximate
LPicola (Howlett et al.
2015) simulator running on two identical nodes.
×
Quantitative results
Figure
11 shows a few slices from a 3D fake
\(I_{1}\) image generated using the full pipeline, alongside a random real
\(I_{1}\) image. Figure
12 shows the summary statistics of 500 GAN generated samples, compared to that of real samples. The visual agreement between the real and generated samples is good, although a trained human eye can still distinguish between real and fake samples. In particular, a careful visualization reveals that the transitions between the different patches are still imperfect and that the generated samples have less long range filaments than the true samples. The summary statistics agree very well for the middle range of mass density, with slight disagreements at the extremes. The shape of the power spectrum density (PSD) matches well, but the overall amplitude is too high for most of the
k range. Naturally, we would expect the error of the multiscale pipeline to be the result of the accumulation of errors from the three upsampling steps. In practice, we observe a lot of similarities between the statistics shown in Fig.
13 (from the last upscaling step) and in Fig.
12 (from the full pipeline). Finally, Table
1 presents the scores obtained for the 30 cubes of size 256
^{3} using the full multiscale pipeline as generated by the full GAN sequence. As explained in Sect.
2.4, the reference score gives an indication of the variability within the training set, i.e, how similar are two independent sample sets from the training set. The reference mass histogram score is much higher than the PSD and the Peak histogram due to the fact that this statistic has in general much less variance. For that reason, it is probably easier to estimate, as indicated by the score of our pipeline. Cosmological analyses typically require the precision of less than few percent on these summary statistics, which is achieved by the GAN method only for specific scales, peak and density values.
Score
\(S^{*}\)

PSD

Mass hist.

Peak hist.


Multiscale

2.72

5.72

0.63

Uniscale

0.01

0.15

0.07

Reference

8.16

1433

8.17

×
×
×
The scores of multiscale approach are much higher than the ones of the simpler singlescale approach described in the following section.
4.3 Comparison to the single scale model
In order to evaluate the effectiveness of the multiscale approach, we compare our model to a uniscale variant that is not conditioned on the downsampled image. Here each patch is generated using only its neighboring 7 border patches and a latent variable that is drawn from the prior distribution. More simply, it is a direct implementation of the principle displayed in Fig.
3 left in 3D, where the discriminator is
not conditioned on the downsampled image. An important issue with the uniscale model is that the discriminator never s more than 64
^{3} pixels at once. Hence, it is likely to fail to capture longrange correlations. Practically, we observed that the training process is unstable and subject to mode collapse. At generation time. the recursive structure of the model often lead to repeating pattern. The resulting models were of bad quality as shown in Fig.
14 and
15. This translates to the score presented in Table
1. This experiment demonstrates that conditioning on data at lower scales does play an important role in generating samples of good quality.
×
×
5 Conclusion
In this work we introduced a new benchmark for the generation of 3D
Nbody simulations using deep generative models. The dataset is made publicly available and contains matter density distributions represented as cubes of
\(256\times 256\times 256\) voxels. While the performance of the generative model can be measured by visual inspection of the generated samples, as commonly done on datasets of natural images, we also offer a more principled alternative based on a number of summary statistics that are commonly used in cosmology. Our investigation into this problem has revealed that several factors make this task challenging, including: (i) the sheer volume of each data sample, which is not straightforwardly tractable using conventional GAN architectures, (ii) the large dynamic range of the data that spans several orders of magnitude; which requires a customdesigned transformation of the voxel values, and (iii) the need for high accuracy required for the model to be practically usable for a real cosmological study. Adding to the difficulties of (i) and (ii), this also requires accurately capturing features that are rare in the training set.
As a first baseline result for the newly introduced benchmark, we proposed a new method to train a deep generative model on 3D images. We split the generation process into the generation of smaller patches as well as condition on neighboring patches. We also apply a multiscale approach to learn multiple WGANs at different image resolutions, each capturing salient features at different scales. This approach is inspired by Laplacian pyramid GAN (Denton et al.
2015) and by PixelCNN (van den Oord et al.
2016), which have both been developed for 2D data.
We find that the proposed baseline produces
Nbody cubes with good visual quality compared to the training data, but significant differences can still be perceived. Overall, the summary statistics between real and generated data match well, but notable differences are present for high voxel values in the mass and peak histograms. The power spectrum has the expected shape, with amplitude that is too high for most of the
k values. The overall level of agreement is promising, but can not yet be considered as sufficient for practical applications in cosmology. Further development will be needed to achieve this goal; in order to encourage it, we have made the dataset and the code publicly available.
In our current model, the discriminator only has access to a partial view of the final image. The dependencies at small scale that may exist between distant patches are therefore not captured by the discriminator. Extending this model to allow the discriminator to have a more global view would be the next logical extension of this work. We have also observed empirically that the extreme right tail of the histogram is often not fully captured by the generator. Designing architectures that would help the generative model to handle large dynamic range in the data could further improve performance. One could also get further inspiration from the literature on generative models for video data, such as Vondrick et al. (
2016), Xiong et al. (
2018), Saito et al. (
2017). Given the observations made in our experiments, one might for instance expect that the two stage approach suggested in Saito et al. (
2017) could address some of the problem seen with the right tail of the distribution.
Another interesting research direction would be to condition the generation process on cosmic time or cosmological parameters. One could for instance rely on a conditional GAN model such as Mirza and Osindero (
2014).
Acknowledgements
We thank Adam Amara for initial discussions on the project idea. We thank Janis Fluri and Raphael Sgier for help with generating the data. We acknowledge the support of the IT service of the Piz Daint computing cluster at the Swiss National Supercomputing Center (CSCS), as well as the Leonhard and Euler clusters at ETH Zurich.
Availability of data and materials
The data and code to reproduce the experiment of this study are available at
https://github.com/nperraud/3DcosmoGAN and
https://zenodo.org/record/1464832.
Competing interests
The authors declare that they have no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by/4.0/.
Appendix 1: Input data transformation
The general practice in machine learning is that the input data is first standardized before giving it to any machine learning model. This preprocessing step is important as many neural networks and optimization blocks are scaledependent, resulting is most architectures working optimally only when the data is appropriately scaled. Problematically, because of the physical law of gravity, most of the universe is empty, while most of the matter is concentrated in a few small areas and filaments. The dataset had the minimum value of 0 and the maximum value of
\(185\text{,}874\), with most of the voxels concentrated close to zero, and significantly skewed towards the smaller values and has an elongated tail towards the larger ones. Even with standardization, it is difficult for a generative model to learn very sparse and skewed distributions. Hence we transform the data using a special function, in a slightly different way than (Rodríguez et al.
2018).
In order to preserve the sensitivity of the smaller values, a logarithmbased transformation function is a good candidate. Nevertheless, to maintain the sensitivity to the large values, we should favor a linear function. In our attempt to coincide the best of the two regimes, we design a function that is logarithmic for lower values and linear for large one, i.e after a specified cutoff value. The exact forward transformation function,
\(y = f(x, c, s)\) is defined as:
where
As a result, the backward transformation function
\(x = b(y, c, s)\) reads
where
Here
c and
s are selected hyperparameters. For our experiments we found
\(c=20\text{,}000\) and
\(s=3\) to be good candidates. After the forward transformation, the distribution of the data becomes similar to a onesided Laplacian distribution. We always invert this transformation once new data is generated, before calculating the summary statistics.
$$ f(x, c, s) = f^{\prime }(x+s, c)  f^{\prime }(s, c), $$
(3)
$$ f^{\prime }(x, c) = \textstyle\begin{cases} 3\log _{e} (x+1) & \text{if }x \leq c, \\ 3(\log _{e} (c+1) + \frac{x}{c}  1) & \text{otherwise}. \end{cases} $$
(4)
$$ b(y, c, s) = b^{\prime }\bigl( y + f^{\prime }(s, c), c\bigr)  s, $$
(5)
$$ b^{\prime }(y, c) = \textstyle\begin{cases} e^{\frac{y}{3}}  1 & \text{if }y \leq 3\log _{e}(c+1), \\ c(\frac{y}{3} + 1  \log _{e}(c+1)) & \text{otherwise}. \end{cases} $$
(6)
Appendix 2: NBody training set augmentation
As
Nbody simulations are very expensive, we need to make sure to use all the information available using an optimal dataset augmentation. To augment the training set, the cubes are randomly rotated by multiples of 90 degrees and randomly translated along one of the 3 axes. The cosmological principle states that the spatial distribution of matter in the universe is homogeneous and isotropic when viewed on a large enough scale. As a consequence there should be no observable irregularities in the large scale structure over the course of evolution of the matter field that was initially laid down by the Big Bang (Dodelson
2003). Hence, the rotational and translational augmentations do not alter the data distribution that we are trying to model in any way. Moreover, we note that use circular translation in our augmentation scheme. This is possible because
Nbody simulations are created using the
periodic boundary condition: a particle exiting the box on one side enters it immediately on the opposite side. Forces follow the same principle. This prevents the particles from collapsing to the middle of the box under gravity. These augmentations are important given that we only have 30
Nbody cubes in our training set.
Appendix 3: Architecture & implementation details
Implementation details
We used Python and Tensorflow to code the models which are trained on GPUs with 16 GB of memory. All the GANs are WGANs with a Gaussian prior distribution. Using a single GPU, it takes around 7 seconds to produce one sample of size 256
^{3} compared to approximately 30 hours for a precise
Nbody simulator running on two nodes with 24 cores and a GPU, such as
PkdGrav3 (Potter et al.
2017). In this project, a fast and approximate
LPicola (Howlett et al.
2015) simulator was used, with approximately 6 hours of runtime on two nodes.
The batch size is set to 8 for all the experiments. All Wasserstein GANs were trained with using a gradient penalty loss with
\(\gamma _{ \mathrm{GP}}=10\), as described in Gulrajani et al. (
2017). We use RMSprop with a learning rate
\(3\cdot 10^{5}\) for both the generator and discriminator. The discriminator was updated 5 times per generator update.
Network architectures
The neural networks used in our experiments are variants of deep convolutional networks with inception modules and/or residual connections.
All weights were initialized using Xavier Initializer, except the bias that was initialized to 0. We used leaky ReLu and spectral normalization (Miyato et al.
2018) to stabilize the network. The architectures are detailed in Tables
2,
3 and
4.
Table 2
Detailled architecture of the low resolution GAN
\(0\rightarrow 32^{3}\).
\(d=64\)
Operation

Parameter size

Output Shape


Generator


Input
\(z\,\mathcal{N}(0,1)\)

(
n,256)


Dense

(256,256
d)

(
n,256
d)

Reshape

(
n,4,4,4,4
d)


TrConv 3D (Sride 2)

(4,4,4,4
d,4
d)

(
n,8,8,8,4
d)

LReLu (
α = 0.2)

(
n,16,16,16,4
d)


TrConv 3D (Sride 2)

(4,4,4,4
d,2
d)

(
n,16,16,16,2
d)

LReLu (
α = 0.2)

(
n,16,16,16,2
d)


TrConv 3D (Sride 2)

(4,4,4,
d,
d)

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,2
d)


TrConv 3D (Sride 1)

(4,4,4,
d,
d)

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,2
d)


TrConv 3D (Sride 1)

(4,4,4,
d,1)

(
n,32,32,32,1)

Discriminator


Input generated image

(
n,32,32,32,1)


Conv 3D (Sride 2)

(4,4,4,1,
d)

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


Conv 3D (Sride 2)

(4,4,4,
d,
d)

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


Conv 3D (Sride 1)

(4,4,4,
d,2
d)

(
n,16,16,16,2
d)

LReLu (
α = 0.2)

(
n,16,16,16,2
d)


Conv 3D (Sride 1)

(4,4,4,2
d,4
d)

(
n,8,8,8,4
d)

LReLu (
α = 0.2)

(
n,8,8,8,4
d)


Conv 3D (Sride 1)

(4,4,4,4
d,8
d)

(
n,4,4,4,8
d)

LReLu (
α = 0.2)

(
n,4,4,4,8
d)


Reshape

(
n,512
d)


Dense

(512
d,1)

(
n,1)

Table 3
Detailled architecture of UpscaleGAN
\(32^{3}\rightarrow 64^{3}\).
\(d=32\)
Operation

Parameter size

Output shape


Generator


Input borders

(
n,32,32,32,7)


Conv 3D (Sride 2)

(4,4,4,7,
d)

(
n,16,16,16,
d)

Conv 3D (Sride 2)

(4,4,4,
d,
d)

(
n,8,8,8,
d)

Conv 3D (Sride 2)

(4,4,4,
d,16)

(
n,4,4,4,16)

Reshape

(
n,1024)


Input
\(z_{1}\,\mathcal{N}(0,1)\)

(
n,1024)


Concatenation

(
n,2048)


Dense

(2048,256
d)

(
n,256
d)

Reshape

(
n,16,16,16,2)


Input downsampled corner

(
n,16,16,16,1)


Input
\(z_{2} \mathcal{N}(0,1)\)

(
n,16,16,16,1)


Concatention

(
n,16,16,16,4)


TrConv 3D (Sride 1)

(4,4,4,4,
d)

(
n,16,16,16,
d)

LReLu (
α = 0.2)

(
n,16,16,16,4
d)


TrConv 3D (Sride 1)

(4,4,4,
d,
d)

(
n,16,16,16,
d)

LReLu (
α = 0.2)

(
n,16,16,16,2
d)


TrConv 3D (Sride 2)

(4,4,4,
d,4
d)

(
n,32,32,32,4
d)

LReLu (
α = 0.2)

(
n,32,32,32,4
d)


TrConv 3D (Sride 1)

(4,4,4,4
d,2
d)

(
n,32,32,32,2
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


TrConv 3D (Sride 1)

(4,4,4,2
d,1)

(
n,32,32,32,1)

Discriminator


Input generated image

(
n,32,32,32,1)


Input borders

(
n,32,32,32,7)


Reshape to a cube

(
n,64,64,64,1)


Input smooth image

(
n,64,64,64,1)


Concatenation (+ diff)

(
n,64,64,64,3)


Conv 3D (Sride 1)

(4,4,4,3,
d)

(
n,64,64,64,
d)

LReLu (
α = 0.2)

(
n,64,64,64,
d)


Conv 3D (Sride 2)

(4,4,4,
d,2
d)

(
n,32,32,32,2
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


Conv 3D (Sride 2)

(4,4,4,2
d,4
d)

(
n,16,16,16,4
d)

LReLu (
α = 0.2)

(
n,16,16,16,
d)


Conv 3D (Sride 1)

(4,4,4,
d,
d)

(
n,16,16,16,
d)

LReLu (
α = 0.2)

(
n,16,16,16,
d)


Conv 3D (Sride 2)

(4,4,4,
d,
d)

(
n,8,8,8,
d)

LReLu (
α = 0.2)

(
n,8,8,8,
d)


Reshape

(
n,512
d)


Dense

(512
d,1)

(
n,1)

Table 4
Detailed architecture of UpscaleGAN
\(64^{3}\rightarrow 256^{3}\).
\(d=64\). The parameter shape of the inception convolution written InConv is too large be written in the table
Operation

Parameter size

Output shape


Generator


Input
\(z\,\mathcal{N}(0,1)\)

–

(
n,32,32,32,1)

Input smooth image

–

(
n,32,32,32,1)

Input borders

–

(
n,32,32,32,7)

Concatention

–

(
n,32,32,32,9)

InConv 3D (Sride 1)

*

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


InConv 3D (Sride 1)

*

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


InConv 3D (Sride 1)

*

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


InConv 3D (Sride 1)

*

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


InConv 3D (Sride 1)

*

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


InConv 3D (Sride 1)

*

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


InConv 3D (Sride 1)

*

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


InConv 3D (Sride 1)

*

(
n,32,32,32,1)

ReLu

–

(
n,32,32,32,1)

Discriminator


Input generated image

–

(
n,32,32,32,1)

Input borders

–

(
n,32,32,32,7)

Reshape to a cube

–

(
n,64,64,64,1)

Input smooth image

(
n,64,64,64,1)


Concatenation (+ diff)

(
n,64,64,64,3)


InConv 3D (Sride 2)

*

(
n,32,32,32,2
d)

LReLu (
α = 0.2)

(
n,32,32,32,2
d)


InConv 3D (Sride 1)

*

(
n,32,32,32,2
d)

LReLu (
α = 0.2)

(
n,32,32,32,2
d)


InConv 3D (Sride 1)

*

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


InConv 3D (Sride 1)

*

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


InConv 3D (Sride 1)

*

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


InConv 3D (Sride 1)

*

(
n,32,32,32,
d)

LReLu (
α = 0.2)

(
n,32,32,32,
d)


InConv 3D (Sride 2)

*

(
n,16,16,16,
d)

LReLu (
α = 0.2)

(
n,16,16,16,
d)


InConv 3D (Sride 2)

*

(
n,8,8,8,
d)

LReLu (
α = 0.2)

(
n,8,8,8,
d)


Reshape

–

(
n,16384)

Compute PSD

–

(
n,1914)

Concatenate

–

(
n,18298)

Dense

(18,298,64)

(
n,64)

LReLu (
α = 0.2)

(
n,64)


Dense

(6416)

(
n,16)

LReLu (
α = 0.2)

(
n,16)


Dense

(16,1)

(
n,1)

Handling the input border is an architectural challenge in itself. We used two different solutions to overcome this issue and use one of them for each scale.
Generator
\(M_{2}\)
The generator
\(M_{2}\) possesses a convolutional encoder for the borders. Once the borders are encoded, we concatenate them with the latent variable. The downsampled image simply concatenated at the first convolution layer (see Table
3).
Generator
\(M_{1}\)
The generator
\(M_{1}\) does not possess an encoder, but utilize the border directly as extra channels. As a result, the generator convolution all have a stride of 1. The downsampled image is first upsampled using a simple transposed convolution with a constant kernel and then concatenated as an input. The latent variable is of size 32
^{3} to avoid a memory consuming linear layer. Eventually, as the convolution is shiftinvariant, we perform two transformations to the input borders before feeding them to the generator. As a results, we flip them to obtain a correct alignment with produced corner. Furthermore, to improve the capacity of the networks without increasing to much the number of parameters and channels, we use an inception inspired module. The module is simply composed of 3 convolutions (
\(1 \times 1\times 1\),
\(2\times 2\times 2\),
\(4 \times 4 \times 4\)) in parallel followed by a merging
\(1 \times 1 \times 1\) convolution. Finally, to further help the discriminator, we also feed some PSD estimation at the beginning of its linear layer (see Table
4).
Training stabilization using a regularizer
While it has been shown that the gradient penalty loss of the Wasserstein GAN helps in stabilizing the training process (Gulrajani et al.
2017), this term does not prevent the discriminator to saturate. For example, when the discriminator has a high final bias, its output will be very large for both real and fake sample, yet its loss might be controlled as the output of real samples is subtracted from the one of the fake samples. In practice, we noticed that when this behavior was happening, the learning process of the generator was hindered and the produced samples were of worse quality. In order to circumvent this issue, we added a second regularization term:
Our idea was that the regularization should kick in only to prevent the undesirable effect and should not affect the rest of the training. If the discriminator is doing a good job, then
\(D_{\mathrm{real}}\) should be positive and
\(D_{\mathrm{fake}}\) negative nullifying the regularization. On the contrary if both of these term are of the same sign, the output will be penalized quadratically forcing it to remain close to 0. While the effect of this second regularization term is still unclear to us, it did help to stabilize our optimization procedure for the multiscale approach.
$$ \text{ReLu}(D_{\mathrm{real}} \cdot D_{\mathrm{fake}}). $$
(7)
As we release our code and entire pipeline, we encourage the reader to check it for additional details.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Footnotes
4
For cosmological scale, the distance is measured in megaparsec (Mpc) (
\(1~\text{Mpc} = 3.3 \times 10^{6}~\text{lightyears}\)).