The recent advances in the field of deep learning have initiated a new era for generative models. Generative Adversarial Networks (GANs) (Goodfellow et al.
2014) have become a very popular approach by demonstrating their ability to learn complicated representations to produce high-resolution images (Karras et al.
2018). In the field of cosmology, high-resolution simulations of matter distribution are becoming increasingly important for deepening our understanding of the evolution of the structures in the universe (Springel et al.
2005; Potter et al.
2017; Kuhlen et al.
2012). These simulations are made using the
N-body technique, which represents the distribution of matter in 3D space by trillions of particles. They are very slow to run and computationally expensive, as they evolve the positions of particles over cosmic time in small time intervals. Generative models have been proposed to emulate this type of data, dramatically accelerating the process of obtaining new simulations, after the training is finished (Rodríguez et al.
2018; Mustafa et al.
2019).
N-body simulations represent the matter in a cosmological volume, typically between 0.1–10 Gpc, as a set of particles, typically between 100
3 to 2000
3. The initial 3D positions of the particles are typically drawn from a Gaussian random field with a specific power spectrum. Then, the particles are displaced over time according to the laws of gravity, properties of dark energy, and other physical effects included in the simulations. During this evolution, the field is becoming increasingly non-Gaussian, and displays characteristic features, such as halos, filaments, sheets, and voids (Bond et al.
1996; Dietrich et al.
2012).
Recently, GANs have been proposed for emulating the matter distributions in two dimensions (Rodríguez et al.
2018; Mustafa et al.
2019). These approaches have been successful in generating data of high visual quality, and almost indistinguishable from the real simulations to experts. Moreover, several summary statistics often used in cosmology, such as power spectra and density histograms, also revealed good levels of performance. Some challenges still remain when comparing sets of generated samples. In both works, the properties of sets of generated images did not match exactly; the covariance matrix of power spectra of the generated maps differed by order of 10% with the real maps.
While these results are encouraging, a significant difficulty remains in scaling these models to generate three-dimensional data, which include several orders of magnitude more pixels for a single data instance. We address this problem in this work. We present a publicly available dataset of N-body cubes, consisting of 30 independent instances. Due to the fact that the dark matter distribution is homogeneous and isotropic, and that the simulations are made using periodic boundary condition, the data can be easily augmented through shifts, rotations, and flips. The data is in the form of a list of particles with spatial positions x, y, z. It can be pixelised into 3D histogram cubes, where the matter distribution is represented in density voxels. Each voxel contains the count of particles falling into it. If the resolution of the voxel cube is high enough, the particle- and voxel-based representations should be able to be used interchangeably for most of the applications. Approaches to generate the matter distribution in the particle-based representation could also be designed; in this work, however, we focus on the voxel-based representation. By publishing the N-body data and the accompanying codes we aim to encourage the development of large scale generative models capable of handling such data volumes.
Our results constitute a baseline solution to the challenge. While the obtained statistical accuracy is currently insufficient for a real cosmological use case, we achieve two goals: (i) we demonstrate that the project is tractable by GAN architectures, and (ii) we provide a framework for evaluating the performance of new algorithms in the future.
Generative models that produce novel representative samples from high-dimensional data distributions are increasingly becoming popular in various fields such as image-to-image translation (Zhu et al.
2017), or image in-painting (Iizuka, Simo-Serra and Ishikawa
2017) to name a few. There are many different deep learning approaches to generative models. The most popular ones are Variational Auto-Encoders (VAE) (Kingma and Welling
2014), Autoregressive models such as PixelCNN (van den Oord et al.
2016), and Generative Adversarial Networks (GAN) (Goodfellow et al.
2014). Regarding prior work for generating 3D images or volumes, two main types of architectures – in particular GANs – have been proposed. The first type (Achlioptas et al.
2018; Fan et al.
2017) generates 3D point clouds with a 1D convolutional architecture by producing a list of 3D point positions. This type of models does not scale to cases where billion of points are present in a simulation, posing an important concern given the size of current and future
N-body simulations. The second type of approaches, including Wu et al. (
2016), Mosser et al. (
2017), directly uses 3D convolutions to produce a volume. Although the computation and memory cost is independent of the number of particles, it scales with the number of voxels of the desired volume, which grows cubically with the resolution. While recursive models such as PixelCNN (van den Oord et al.
2016) can scale to some extent, they are slow to generate samples, as they build the output image pixel-by-pixel in a sequential manner. We take inspiration from PixelCNN to design a patch-by-patch approach, rather than a pixel-by-pixel approach, which significantly speeds up the generation of new samples.
As mentioned above, splitting the generation process into patches reduces the ability of the generator to learn global image features. Some partial solutions to this problem can already be found in the literature, such as the Laplacian pyramid GAN (Denton et al.
2015) that provides a mechanism to learn at different scales for high quality sample generation, but this approach is not scalable as the sample image size is still limited. Similar techniques are used in the problem of super-resolution (Ledig et al.
2017; Lai et al.
2017; Wang et al.
2018). Recently, progressive growing of GANs (Karras et al.
2018) has been proposed to improve the quality of the generated samples and stabilize the training of GANs. The size of the samples produced by the generator is progressively increased by adding layers at the end of the generator and at the beginning of the discriminator. In the same direction, Brock et al. (
2019), Lučić et al. (
2019) achieved impressive quality in the generation of large images by leveraging better optimization. Problematically, the limitations of the hardware on which the model is trained occur after a certain increase in size and all of these approaches will eventually fail to offer the scalability we are after.
GANs were proposed for generating matter distributions in 2D. A generative model for the projected matter distribution, also called a
mass map, was introduced by Mustafa et al. (
2019). Mass maps are cosmological observables, as they are reconstructed by techniques such as, for example, gravitational lensing (Chang et al.
2018). Mass maps arise through integration of the matter density over the radial dimension with a specific, distance-dependent kernel. The generative model presented in Mustafa et al. (
2019) achieved very good agreement with the real data several important non-Gaussian summary statistics: power spectra, density histograms, and Minkowski functionals (Schmalzing et al.
1996). The distributions of these summaries between sets of generated and real data also agreed well. However, the covariance matrix of power spectra within the generated and real sets did not match perfectly, differing by the order of 10%.
A generative model working on 2D slices from
N-body simulations was developed by Rodríguez et al. (
2018).
N-body slices have much more complex features, such as filaments and sheets, as they are not averaged out in projection. Moreover, the dynamic range of pixel values spans several orders of magnitude. GANs presented by Rodríguez et al. (
2018) also achieved good performance, but only for larger cosmological volumes of 500 Mpc. Some mismatch in the power spectrum covariance was also observed.
Alternative approaches to emulating cosmological matter distributions using deep learning have been recently been proposed. Deep Displacement Model (He et al.
2018) uses a U-shaped neural network that learns how to modify the positions of the particles from initial conditions to a given time in the history of the universe.
Generative models have also been proposed for solving other problems in cosmology, such as generation of galaxies (Regier et al.
2015), adding baryonic effects to the dark matter distribution (Tröster et al.
2019), recovery of certain features from noisy astrophysical images (Schawinski et al.
2017), deblending galaxy superpositions (Reiman and Göhre
2019), improving resolution of matter distributions (Kodi Ramanah et al.
2019).