Weak-lensing Mass Reconstruction of Galaxy Clusters with a Convolutional Neural Network

Sungwook E. Hong; Sangnam Park; M. James Jee; Dongsu Bak; Sangjun Cha

doi:10.3847/1538-4357/ac3090

1. Introduction

Weak-lensing (WL) is now firmly established as the most direct method for measuring the mass of an astrophysical object ranging from a galaxy (galaxy-galaxy lensing) to a cosmological large-scale structure (cosmic shear). Many ongoing and future WL surveys happening on massive scales reflect the elevated level of interest and confidence in this technique (e.g., Laureijs et al. 2011; Spergel et al. 2015; Troxel et al. 2018; Hikage et al. 2019; Ivezić et al. 2019). Without question, in order to maximize the scientific return from these huge data volumes, our highest priority is to understand and control systematics. A number of issues on WL systematics have been identified, including shear calibration, photometric redshift degeneracy, model bias, mass-sheet degeneracy, and astrophysical processes (e.g., Gorenstein et al. 1988; Seitz & Schneider 1996; Squires & Kaiser 1996; High et al. 2007; Jarvis et al. 2008; Mandelbaum et al. 2015; Meyers & Burchat 2015).

In this study, we focus on systematics arising in galaxy cluster mass reconstruction from WL source catalogs. Although galaxy cluster mass reconstruction is one of the earliest WL applications and a demonstration of its power, the main utility of the two-dimensional mass reconstruction has been rather a qualitative investigation of the relative mass distribution of the target field. Very few studies employed the mass reconstruction for quantitative analysis (e.g., derivation of galaxy cluster masses). This is because the current mass-reconstruction algorithms suffer from various artifacts. For example, the so-called mass-sheet degeneracy (invariant of the observable shear under a certain linear rescaling of the mass) is one of the major obstacles that prevent us from interpreting the result in absolute terms. Severe nonlinearity arising from the transformation of the reduced shear to the convergence is also a crucial contributing factor. Other critical issues include the finite-field effect, ill-posed mathematical inversion, smoothing artifacts, and field edge systematics (e.g., Bartelmann 1995; Seitz & Schneider 1996).

The most popular method for estimating the cluster mass so far has been to fit an analytic profile to the observed shear. This assumes that the cluster mass distribution is spherically symmetric and follows a particular halo model favored by numerical simulations, such as an Navarro–Frenk–White (Navarro et al. 1996) profile. Although this provides a method for overcoming the aforementioned drawbacks of the mass reconstruction, the obvious weakness is that individual galaxy clusters do not exactly follow the analytical description, much less are consistent with the assumption of spherical symmetry.

In this paper, we introduce a novel method for mass reconstruction based on a convolutional neural network (CNN). CNN is a branch of deep learning, which has been considered to be a promising tool in many fields of astronomy in recent years, such as photometric redshift (e.g., Schaefer et al. 2018), strong-lensing finding (e.g., Pasquet et al. 2019), image deconvolution (e.g., Flamary 2017), star-galaxy separation (e.g., Kim & Brunner 2017), and morphological classification (e.g., Mittal et al. 2020). The current study is the first endeavor to apply CNN to the WL mass reconstruction of galaxy clusters. Since there is a rapid growth in data size and complexity from future WL surveys, the approach introduced here will find many useful applications if our CNN algorithm can significantly reduce the systematics described above that are found in the traditional mass reconstruction.

This paper is organized as follows. In Section 2 we describe the basic theory, CNN architecture, and training data sets. The performance of our CNN mass reconstruction is presented in Section 3 and discussed in Section 4, before we conclude in Section 5. Throughout the paper, we assume a flat ΛCDM cosmology with H₀ = 70 km s⁻¹ Mpc⁻¹, Ω_Λ = 0.7, and Ω_m = 0.3.

2. Methods

2.1. Basic Weak-lensing Theory

The basic WL theory is briefly reviewed here to make our method description self-contained. We refer readers to other excellent review papers for further details (e.g., Mellier 1999; Bartelmann & Schneider 2001; Hoekstra et al. 2013). The WL formalism is valid in the regime where the source galaxy is much smaller than the characteristic scale of the gravitational potential variation. In this regime, the transformation matrix A relating the source plane position x to the image plane position ${{\boldsymbol{x}}}^{{\prime} }$ via ${{\bf{x}}}^{{\prime} }={\boldsymbol{A}}{\boldsymbol{x}}$ is described by

$\begin{eqnarray}{\boldsymbol{A}}=(1-\kappa )\left(\begin{array}{ll}1-{g}_{1} & -{g}_{2}\\ -{g}_{2} & 1+{g}_{1}\end{array}\right),\end{eqnarray} \tag{ 1 }$

where g₁₍₂₎ is the first (second) component of the reduced shear $g={({g}_{1}^{2}+{g}_{2}^{2})}^{1/2}$ , and κ is the convergence. The reduced shear g is related to the shear γ and convergence κ via

$\begin{eqnarray}&&g=\gamma /(1-\kappa ).\end{eqnarray} \tag{ 2 }$

The convergence κ is the unitless surface mass density:

$\begin{eqnarray}&&\kappa =\displaystyle \frac{{\rm{\Sigma }}}{{{\rm{\Sigma }}}_{c}},\end{eqnarray} \tag{ 3 }$

where Σ_c is the critical surface mass density:

$\begin{eqnarray}&&{{\rm{\Sigma }}}_{c}=\displaystyle \frac{{c}^{2}{D}_{s}}{4\pi {{GD}}_{l}{D}_{{ls}}}.\end{eqnarray} \tag{ 4 }$

In Equation (4), c is the speed of light, G is the gravitational constant, D_l is the angular diameter distance to the cluster (lens), D_ls is the angular diameter distance from the lens to the source, and D_s is the angular diameter distance to the source.

The transformation matrix A in Equation (1) converts a circle into an ellipse. There are multiple ways to define the ellipticity of the resulting ellipse, which has been a source of confusion. If we let its semimajor and -minor axes be a and b, respectively, one can show that the reduced shear g in Equation (1) becomes

$\begin{eqnarray}&&g=\displaystyle \frac{a-b}{a+b}.\end{eqnarray} \tag{ 5 }$

Therefore, it is convenient to use Equation (5) to define the ellipticity in WL, which we also adopt in this paper. Since g alone cannot express the orientation of the ellipse, the WL community often uses the complex notation

$\begin{eqnarray}&&{\boldsymbol{g}}={g}_{1}+{\boldsymbol{i}}{g}_{2},\end{eqnarray} \tag{ 6 }$

which provides both magnitude $g={({g}_{1}^{2}+{g}_{2}^{2})}^{1/2}$ and orientation $\phi =0.5{\tan }^{-1}({g}_{2}/{g}_{1})$ of the elongation.

Under the assumption that we can assign a unique ellipticity to every galaxy, the same complex notation (Equation (6)) can also be used to express its intrinsic ellipticity e = e₁ + i e₂ prior to WL distortion. Then, the transformation of the intrinsic ellipticity e to the lensed (distorted) ellipticity by the reduced shear g is given by

$\begin{eqnarray}&&{\boldsymbol{\epsilon }}=\displaystyle \frac{{\boldsymbol{e}}+{\boldsymbol{g}}}{1+{{\boldsymbol{g}}}^{* }{\boldsymbol{e}}}\,{\rm{for}}\,| {\boldsymbol{g}}| \lt 1\end{eqnarray} \tag{ 7 }$

and

$\begin{eqnarray}&&{\boldsymbol{\epsilon }}=\displaystyle \frac{1+{\boldsymbol{g}}{{\boldsymbol{e}}}^{* }}{{{\boldsymbol{e}}}^{* }+{{\boldsymbol{g}}}^{* }}\,{\rm{for}}\,| {\boldsymbol{g}}| \gt 1,\end{eqnarray} \tag{ 8 }$

where the asterisk denotes the complex conjugate.

Inspection of Equation (7) reveals that in general, each galaxy's lensed ellipticity is only slightly different from its intrinsic ellipticity e in the WL regime, where g is small. When we disregard measurement systematics and assume that the ellipticity distribution of the source population is isotropic, one can show that the unbiased estimator for g is $\left\langle {\boldsymbol{\epsilon }}\right\rangle$ .

2.2. Conventional Mass Reconstruction and Its Limitation

The mathematical relation between γ shear and convergence κ at the position x is

$\begin{eqnarray}&&{\boldsymbol{\gamma }}({\boldsymbol{x}})=\displaystyle \frac{1}{\pi }\int {\boldsymbol{D}}({\boldsymbol{x}}-{{\boldsymbol{x}}}^{{\prime} })\kappa ({{\boldsymbol{x}}}^{{\prime} })d{{\boldsymbol{x}}}^{{\prime} },\end{eqnarray} \tag{ 9 }$

where the kernel D is

$\begin{eqnarray}&&{\boldsymbol{D}}=-\displaystyle \frac{1}{{\left({x}_{1}-{\boldsymbol{i}}{x}_{2}\right)}^{2}}.\end{eqnarray} \tag{ 10 }$

The well-known (Kaiser & Squires 1993, KS93) mass reconstruction is based on the straightforward inversion of Equation (11):

$\begin{eqnarray}&&\kappa ({\boldsymbol{x}})=\displaystyle \frac{1}{\pi }\int {{\boldsymbol{D}}}^{* }({\boldsymbol{x}}-{{\boldsymbol{x}}}^{{\prime} }){\boldsymbol{\gamma }}({{\boldsymbol{x}}}^{{\prime} })d{{\boldsymbol{x}}}^{{\prime} }.\end{eqnarray} \tag{ 11 }$

KS93 evaluate this convolution in Fourier space, while Fischer & Tyson (1997) develop an inversion method in real space. Alternatively, some authors propose to reconstruct the convergence field using Equation (9) through the maximum likelihood method (e.g., Seitz et al. 1998; Bradač et al. 2004; Jee et al. 2007).

Inspection of Equations (9) and (11) shows that several artifacts may be introduced from the KS93 mass reconstruction. First, the evaluation of the convolution suffers from the so-called finite-inversion problem because in principle, κ requires the information of the shear γ over an infinite area. Second, Equation (11) uses γ for its input, whereas the directly attainable information from averaging over many galaxy shapes is only g . Third, the solution is not unique because the same equation holds under the transformation κ → λ κ + (1 − λ), where λ is arbitrary. This ambiguity is often termed the "mass-sheet degeneracy" because, although mathematically somewhat misleading, one can view the transformation as an addition of a thin sheet of mass when λ ≈ 1. Fourth, in the central region of massive clusters where the WL assumption no longer holds, the above equations lose their validity. Fifth, to suppress noise amplification in the inversion, we can only obtain a smoothed convergence field, which gives a biased mass estimate even in the ideal situation where all other issues are carefully accounted for.

2.3. Mass Reconstruction with a Convolutional Neural Network

2.3.1. Generation of the Training Data Set

Generation of our training data set starts from the convergence (κ) maps created from cosmological simulations via ray tracing. We use the publicly available data MassiveNuS (Liu et al. 2018). The simulation was originally designed to investigate the impact of massive neutrinos on the large-scale structure. For the current investigation, we chose to retrieve⁶ the data set corresponding to the ∑m_ν = 0.177 eV, Ω_m = 0.2485, and A_s = 2.0644 × 10⁻⁹ setting. However, we emphasize that the details in the simulation parameters are not important within the scope of the current study because, as we demonstrate below, our CNN algorithm is designed to learn the rule that maps the reduced shear field to the convergence field according to general relativity and thus is independent of the above cosmological parameters.

The original data set consists of a total of 50,000 convergence images at five (z = 0.5, 1.0, 1.5, 2.0, and 2.5) different source redshifts (10,000 convergence fields per source redshift), each simulating an area of 3 fdg 5 × 3 fdg 5 with a pixel resolution of $0\buildrel{\,\prime}\over{.} 4$ , which matches the field of view (FOV) of the Vera C. Rubin Observatory (Ivezić et al. 2019). The large convergence pixel means that the maximum convergence value never exceeds unity. For the same field, a higher source redshift convergence image is richer in substructure as more line-of-sight structure is included and the lensing efficiency also becomes higher. We use the data set created for a source redshift of 1.5. We verified that training with all five source redshift data does not improve the result.

We identified clusters by running SExtractor (Bertin & Arnouts 1996) on the convergence field image and cropped a $32^{\prime} \times 32^{\prime}$ region approximately centered on each cluster. We randomly generated positions of 25,000 sources within this subfield. The distribution matches the typical source density of ∼25 per sq. arcmin in our previous Subaru WL studies (e.g., Finner et al. 2017; Kim et al. 2019; Yoon et al. 2020). Shears γ at the position x were computed using Equation (9). This shear γ is then converted into the reduced shear g through g = γ /(1 − κ). Here we do not consider dispersions in lens and source redshift and assume that all lensing masses and sources are confined to z_l = 0.5 and z_s = 1.5, respectively. Furthermore, as explained above, no convergence pixel exceeds unity, and thus only Equation (7) is needed for the ellipticity transformation.

We set the intrinsic shape noise per component to σ_e = 0.24, which is approximately the empirical value from the Hubble Space Telescope (HST) image analysis. In addition to this shape noise, there is a measurement error due to pixel noise. Assuming that the measurement error is independent of the shape noise, we produced the total ellipticity error as a sum of two Gaussian random numbers. The measurement error depends on galaxy properties (e.g., magnitude, size, and profile shape) and signal-to-noise ratios (S/N). We adopted the source magnitude distribution of the Kim et al. (2019) study, which starts at ∼21.5th mag, peaks at ∼25.5th mag, and truncates at ∼27.5th mag in V band. The observed relation between magnitude and ellipticity measurement error in Kim et al. (2019) is employed to generate the ellipticity measurement error for our sources. We select 7000 convergence fields and divide them into 5000 training, 1800 validation, and 200 test samples.

2.3.2. Architecture of the CNN

Figure 1 and Table 1 summarize the architecture of our CNN model that we use to predict the convergence map from the WL shear data sets. Our CNN model takes two-dimensional (2D) arrays of ₁( x ), ₂( x ), and Δ( x ) as a three-channel input, where x denotes 2D pixel coordinates, _i( x )(i = 1, 2) and Δ( x ) are the i-th average ellipticity (reduced shear) component and its error at the position x . Because we randomly positioned source galaxies (Section 2.3.1), these (regularly spaced) input grids were constructed by weight-averaging the (irregularly spaced) source galaxy ellipticities with a FWHM = 7'' Gaussian kernel; we used the distance between the center of each grid and the source position for the kernel evaluation. For each cluster, the full area of the initial field is $32^{\prime} \times 32^{\prime}$ , which is represented by 2D arrays of 500 × 500. We performed data augmentation by subsampling $24\buildrel{\,\prime}\over{.} 7\times 24\buildrel{\,\prime}\over{.} 7$ (386 × 386) regions 1444 times. In addition, we applied four rotations (0°, 90°, 180°, and 270°) and two axis flips. The total number of the resulting subfields for each cluster is 1444 × (4 + 2) = 8664. The subsampling scheme also prevents the CNN from learning that the position of the cluster is always at the field center.

epsilon — **Figure 1.** Schematic diagram showing the architecture of our convolutional neural network. The input channel consists of three (₁, ₂, and Δ) layers of 2D (386 × 386) arrays. The main part of the CNN architecture is the repeated combination of convolution and transposed convolution with 49 × 49 size filters. We use skip-connections between the output of each convolution layer and that of the previous layer that has the matching size. See Table 1 and text for details.
Download figure:
Standard image High-resolution image

Table 1. Outline of Our Convolutional Neural Network

Layer	Filter Size	Multiplied to	Output Size
Input	⋯	⋯	(2, 386, 386)
Conv2D-1	(3, 3)	⋯	(8, 384, 384)
AvgPool	(3, 3)	⋯	(8, 128, 128)
TransConv2D-1	(49, 49)	⋯	(8, 176, 176)
Conv2D-2	(49, 49)	⋯	(8, 128, 128)
Multiply-1	⋯	AvgPool	(8, 128, 128)
TransConv2D-2	(49, 49)	⋯	(8, 176, 176)
Conv2D-3	(49, 49)	⋯	(8, 128, 128)
Multiply-2	⋯	Multiply-1	(8, 128, 128)
TransConv2D-3	(49, 49)	⋯	(8, 176, 176)
Conv2D-4	(49, 49)	⋯	(8, 128, 128)
Multiply-3	⋯	Multiply-2	(8, 128, 128)
TransConv2D-4	(49, 49)	⋯	(8, 176, 176)
Conv2D-5	(49, 49)	⋯	(8, 128, 128)
Multiply-4	⋯	Multiply-1	(8, 128, 128)
TransConv2D-5	(49, 49)	⋯	(16, 176, 176)
Output	(49, 49)	⋯	(1, 128, 128)

Note. See text and Figure 1 for a description of each layer.

Download table as: ASCII Typeset image

The main part of our CNN architecture includes the repeated combinations of 2D convolution (Conv2D-# in Table 1) and transposed-convolution layers (TransConv2D-#) with the identical filter size. We tested various choices of filter sizes and found that the 49 × 49 filter gives the best overall performance. Readers are referred to the Appendix for performance comparisons among different CNN architectures with various choices of filter sizes, input layers, and loss functions. The Conv2D-# and TransConv2D-# operations are activated by the hyperbolic tangent function (tanh). Inspired by the residual neural network (ResNet; He et al. 2016), we use skip-connections between the output of each Conv2D-# layer and that of the previous layer with the matching output size by applying a multiplication operation (Multiply-#). We apply batch normalization to the output of each Multiply-# layer to avoid the so-called gradient-vanishing problem (Ioffe & Szegedy 2015). These repeated operations of Conv2D-# and TransConv2D-# are designed to extract features while preserving the size of the output layer (without introducing any padding). Moreover, this architecture outperforms the other architectures that have only convolution layers when it comes to the prediction of the mass peak positions (see the Appendix). Although we did not use any arbitrary padding, the convergence estimates near the field boundary can easily be influenced by the nonvanishing filter size. Therefore, the values within the 14 boundary pixels were not used during our training. The pixel scale of the final κ map is $0\buildrel{\,\prime}\over{.} 192\,{{\rm{pixel}}}^{-1}$ .

We performed our CNN training with Tensorflow (Abadi et al. 2015). During the training, we used the Adam optimizer (Kingma & Ba 2014) with a learning rate 10⁻⁵, 50 mini-batches per each step, and 20 steps per each epoch. In this study, we introduce the following weighted mean-square error (MSE) inspired by the focal loss (Lin et al. 2017):

$\begin{eqnarray}&&{ \mathcal L }=\sum _{{\boldsymbol{x}}}{\omega }_{f}({\boldsymbol{x}}){\left[{\kappa }_{\mathrm{pred}}({\boldsymbol{x}})-{\kappa }_{\mathrm{truth}}({\boldsymbol{x}})\right]}^{2},\end{eqnarray} \tag{ 12 }$

where the weight ω_f( x ) at each pixel is determined by the value of truth convergence:

$\begin{eqnarray}&&{\omega }_{f}({\boldsymbol{x}})=1+\displaystyle \frac{\left|{\kappa }_{\mathrm{truth}}({\boldsymbol{x}})\right|}{\max ({\kappa }_{\mathrm{truth}})}.\end{eqnarray} \tag{ 13 }$

This loss function is chosen so that our CNN model is mostly constrained by the high-density regions of clusters, where our scientific interests lie (see the Appendix for performance comparisons). We ran 200 epochs with the NVIDIA V100 GPU, which take about one hour per each training. For the convergence test of our CNN training, we executed 10 independent runs with the same CNN architecture. For each run, we adopted the model that minimized the validation loss. In our presentation of the results (Section 3), the standard deviations from the 10 runs are used as error estimates.

3. Results

In this section, we compare our CNN results with those of KS93 for the test sample comprised of 200 cluster fields. Because of the data augmentation procedure (Section 2.3), multiple (subsampled) mass maps are produced for each cluster. Thus, we created one mosaic convergence image for each cluster by taking the average of the multiple mass maps. The cluster is approximately located at the center in this mosaic, and we use the mosaic for comparison with the truth and KS93 results. We recall that because these multiple mass maps are generated from the same source catalog, this mosaicking procedure does not benefit by reducing the statistical noise.⁷ In Section 3.1 we use visual inspection to qualitatively compare the reconstruction results. In Section 3.2 we contrast the values of the reconstructed convergence with those of the truth by pixel-by-pixel comparison and by evaluating their probability distributions. In Sections 3.3 and 3.4 we investigate the reconstructed cluster masses and the positional accuracy of their density peaks, respectively. Finally, we examine performances of our CNN method in the presence of bright stars in Section 3.5. Table 2 summarizes our comparison between the KS93 and CNN results.

Table 2. Summary of the Performances of Our CNN and the KS93 Mass Reconstructions for the 200 Test Data Sets

Method	${ \mathcal D }({\widetilde{\kappa }}_{\mathrm{pred}},{\widetilde{\kappa }}_{\mathrm{truth}})$	${M}_{\mathrm{pred}}^{\mathrm{cl}}/{M}_{\mathrm{truth}}^{\mathrm{cl}}$	Δ_peak
KS93	6.26 ± 4.57	${0.484}_{-0.149}^{+0.218}$	$4\buildrel{\,\prime}\over{.} {27}_{-3\buildrel{\,\prime}\over{.} 94}^{+8\buildrel{\,\prime}\over{.} 63}$
CNN	4.36 ± 3.77	${0.867}_{-0.296}^{+0.327}$	$0\buildrel{\,\prime}\over{.} {60}_{-0\buildrel{\,\prime}\over{.} 38}^{+4\buildrel{\,\prime}\over{.} 92}$
CNN-BS	3.66 ± 3.22	${0.554}_{-0.211}^{+0.257}$	$1\buildrel{\,\prime}\over{.} {53}_{-1\buildrel{\,\prime}\over{.} 14}^{+4\buildrel{\,\prime}\over{.} 21}$

Note. See text for the definition of the D metric.

Download table as: ASCII Typeset image

3.1. Qualitative Comparison Based on Visual Inspection

Figure 2 displays an example of our CNN mass reconstruction. The comparison with the truth and KS93 results illustrates that our CNN mass reconstruction is superior to that of KS93 in terms of (1) the recovery of the true κ range, (2) the representation of the large-scale structure around the cluster, and (3) the suppression of the noise in the cluster outskirts. Although here only one case is illustrated, these advantages are present for the rest of the test sample.

Recovery of the κ range. The truth map shows that κ ranges from ∼0.05 to ∼0.25, where the maximum value is found at the cluster center. The KS93 reconstruction fails to recover this convergence range at the high end. The convergence value at the cluster center is only ∼0.05, while the global maximum is found at a different location (see the location of the X). On the other hand, the CNN mass reconstruction gives a much higher value κ ∼ 0.15 at the cluster center. Given the inevitable smoothing effect arising from the sparse sampling (25 sources per sq. arcmin), we believe that the improvement over the KS93 is remarkable. We discuss this issue more quantitatively in Section 3.2.
Reconstruction of the large-scale structure. Inspection of the truth map (Figure 2) indicates that the cluster is not isolated, but is located in the high-density environment. While it is difficult to trace this large-scale structure surrounding the cluster in the KS93 result, the feature, albeit somewhat smoothed, clearly stands out as an overdense region in our CNN mass reconstruction.
Suppression of noise. Since the S/N value depends on the local strength of WL signal given the same number density of sources, an ideal mass-reconstruction method should employ a smoothing scheme where the kernel size matches the local S/N value. However, in general, it is nontrivial to implement such an "adaptive smoothing" scheme in practice because the S/N information is only obtained after a high-quality convergence field is reconstructed. Therefore, a common practice in the WL community is to perform mass reconstruction with a fixed-size smoothing kernel often optimized for the central region of the cluster (e.g., van Waerbeke 2000). The inevitable artifact is the production of many spurious mass peaks in the cluster outskirt where the S/N value is low. The comparison between the KS93 and our CNN results shows that our CNN result nicely suppresses the noise fluctuation in the outskirt region while still detecting substructures if they are significant (see the bottom panel of Figure 2).

$\widetilde{\kappa }({\boldsymbol{x}})\equiv (\kappa ({\boldsymbol{x}})-\langle \kappa \rangle )/{\rm{\Delta }}\kappa $ — **Figure 2.** Example of our CNN mass reconstruction. We show the truth convergence map (left) and the reconstructions with the KS93 (middle) and our CNN methods (right). The top panel displays the convergence values κ( x ) as are, whereas the bottom panel shows the rescaled versions using the transformation $\widetilde{\kappa }({\boldsymbol{x}})\equiv (\kappa ({\boldsymbol{x}})-\langle \kappa \rangle )/{\rm{\Delta }}\kappa$ , where 〈κ〉 and Δκ are the average and standard deviation, respectively, evaluated within the field. The X denotes the location of the highest value within each convergence field. Here we only display the central $26\buildrel{\,\prime}\over{.} 6\times 26\buildrel{\,\prime}\over{.} 6$ region. Visual inspection shows that our CNN reconstruction significantly outperforms the KS93 method in terms of the dynamical range restoration, noise suppression, and large-scale structure representation.
Download figure:
Standard image High-resolution image

3.2. Convergence Distribution

The literature has shown that the distribution of the convergence field can be well approximated by a log-normal distribution characterized by an extended high-end tail (e.g., Jain et al. 2000; Hilbert et al. 2011; Clerkin et al. 2017). Figure 3 compares the convergence distributions between the KS93 and our CNN reconstructions for the entire test sample and shows that the CNN distribution follows the log-normal trend of the truth (see the left panel). On the other hand, the convergence distribution in the KS93 result is symmetric around zero without any sign of an extended tail at the high end.

**Figure 3.** Comparison of convergence (κ) distributions. We measure the κ distribution from the entire test sample (left). The right panel is the same, except that the distribution is obtained for $\widetilde{\kappa }$ . The orange shade represents the standard deviation measured from our 10 independent runs. The CNN reconstruction provides an extended tail at the high end, mimicking the feature in the truth, whereas the KS93 distribution is nearly symmetric around zero. When the convergence is rescaled with its standard deviation, the agreement improves, as can be seen in the right panel.
Download figure:
Standard image High-resolution image

**Figure 3.** Comparison of convergence (κ) distributions. We measure the κ distribution from the entire test sample (left). The right panel is the same, except that the distribution is obtained for $\widetilde{\kappa }$ . The orange shade represents the standard deviation measured from our 10 independent runs. The CNN reconstruction provides an extended tail at the high end, mimicking the feature in the truth, whereas the KS93 distribution is nearly symmetric around zero. When the convergence is rescaled with its standard deviation, the agreement improves, as can be seen in the right panel.
Download figure:
Standard image High-resolution image

The comparison between the CNN result and the truth shows that the CNN distribution is somewhat narrower. This happens because the CNN mass map based on a finite number of source galaxies (25 galaxies per sq. arcmin) is inevitably smoother than the truth. In order to compensate for this smoothing effect, we propose the following normalization:

$\begin{eqnarray}&&\widetilde{\kappa }({\boldsymbol{x}})\equiv (\kappa ({\boldsymbol{x}})-\langle \kappa \rangle )/{\rm{\Delta }}\kappa ,\end{eqnarray} \tag{ 14 }$

where 〈κ〉 and Δκ are the average and standard deviation, respectively. The rescaling of the convergence through this normalization takes the reduction of Δκ in mass reconstruction into account. However, as mentioned in Section 3.1, the smoothing kernel is not uniform in the CNN mass reconstruction; effectively, the cluster outskirts are smoothed with larger kernels than the cores. Therefore, the proposed normalization (Equation (14)) does not completely resolve the issue. Nevertheless, the right panel of Figure 3 shows that the agreement between the CNN and truth distributions improves dramatically after the normalization.

The joint distribution shown in Figure 4 also confirms that the CNN mass reconstruction provides significantly better pixel-to-pixel correlations with the truth. Furthermore, similarly to the previous case, the normalization significantly strengthens the correlation with the truth (Figure 5).

**Figure 5.** Similar to Figure 4, except that the plots are drawn with the normalized convergence ( $\widetilde{\kappa }$ ).
Download figure:
Standard image High-resolution image

To quantify the similarity of the reconstruction to the truth, one can suggest the absolute deviation $| {\widetilde{\kappa }}_{\mathrm{pred}}({\boldsymbol{x}})-{\widetilde{\kappa }}_{\mathrm{truth}}({\boldsymbol{x}})|$ as a potential metric. However, this metric, if used as is, would be dominated by the statistics of the convergence pixels near zero. Therefore, we introduce the weighted version as follows:

$\begin{eqnarray}&&{ \mathcal D }=\displaystyle \frac{\sum _{{\boldsymbol{x}}}{\widetilde{\omega }}_{p}({\boldsymbol{x}})\left|{\widetilde{\kappa }}_{\mathrm{pred}}({\boldsymbol{x}})-{\widetilde{\kappa }}_{\mathrm{truth}}({\boldsymbol{x}})\right|}{\sum _{{\boldsymbol{x}}}{\widetilde{\omega }}_{p}({\boldsymbol{x}})},\end{eqnarray} \tag{ 15 }$

where the weight ${\widetilde{\omega }}_{p}({\boldsymbol{x}})$ is inversely proportional to the probability distribution:

$\begin{eqnarray}&&\displaystyle \frac{1}{{\widetilde{\omega }}_{p}({\boldsymbol{x}})}={\left.\displaystyle \frac{{df}}{d{\widetilde{\kappa }}_{\mathrm{truth}}}\right|}_{{\widetilde{\kappa }}_{\mathrm{truth}}({\boldsymbol{x}})}.\end{eqnarray} \tag{ 16 }$

In practice, the weight can diverge when noise makes the derivatives close to zero. To prevent this, we use a discrete histogram of the truth with 50 bins for the estimation of ${\widetilde{\omega }}_{p}$ . The ${ \mathcal D }$ (better if closer to zero) metric from the CNN mass reconstruction is 4.36 ± 3.77. On the other hand, the KS93 result gives ${ \mathcal D }=6.26\pm 4.57$ . This metric indicates that the κ statistics from the CNN reconstruction better match those from the truth.

3.3. Projected Cluster Mass

The pixel-to-pixel comparison in Section 3.2 shows that although our CNN mass reconstruction better recovers the convergence statistics of the truth than the KS93 method, the distribution is somewhat narrower because of the smoothing that is implicitly applied to the reconstructed convergence field via CNN. It is our premise that this smoothing artifact is of a lesser concern when the interest is to estimate the integrated convergence within a reasonably large aperture. We define the projected cluster mass ${M}_{\mathrm{truth}}^{\mathrm{cl}}$ to be the sum of the convergence values within the $r=1\buildrel{\,\prime}\over{.} 92$ (10 convergence pixel) radius aperture. At the cluster redshift of 0.5, the radius corresponds to 0.72 Mpc with the adopted cosmology. A projected cluster mass of 10 (∑κ = 10) corresponds to ∼7 × 10¹³ M_☉.⁸

Figure 6 shows the comparison of ${M}_{\mathrm{truth}}^{\mathrm{cl}}$ between the reconstructed and the truth values. As seen in the pixel-to-pixel comparison, the CNN mass reconstruction also outperforms the KS93 result in the cluster mass estimation. In addition, it is remarkable that the agreement with the truth is significantly better than the one in the convergence pixel-to-pixel comparison. The slope ${M}_{\mathrm{CNN}}^{\mathrm{cl}}/{M}_{\mathrm{truth}}^{\mathrm{cl}}={0.867}_{-0.296}^{+0.327}$ is consistent with unity. On the other hand, we obtain ${M}_{\mathrm{KS}}^{\mathrm{cl}}/{M}_{\mathrm{truth}}^{\mathrm{cl}}={0.484}_{-0.149}^{+0.218}$ for the KS93 reconstruction, which is a ≳2σ departure from unity. For the case of CNN, the data points at ${M}_{\mathrm{truth}}^{\mathrm{cl}}\gtrsim 40$ hint at the possibility that the estimated masses may be systematically lower. Although the sample size is small in this regime, we speculate that this may happen because the employed aperture radius ( $r=1\buildrel{\,\prime}\over{.} 92$ ) is not sufficiently large for these very massive clusters.

**Figure 6.** Comparison of projected cluster masses between prediction and truth. The projected cluster mass M^cl is defined to be the sum of convergence values within a $r=1\buildrel{\,\prime}\over{.} 92$ (∼0.72 Mpc) radius circular aperture centered on the truth mass peak. Filled orange circles are the median values of ${M}_{\mathrm{truth}}^{\mathrm{cl}}$ in the equal-width bins, and their error bars represent the 68% certainties of ${M}_{\mathrm{truth}}^{\mathrm{cl}}$ ad ${M}_{\mathrm{pred}}^{\mathrm{cl}}$ within the bins. The errors on individual data points (filled black circles) in the right panel are the standard deviations of CNN results from 10 independent runs. The dashed blue lines and filled area are the median and 68% confidence levels of the ratio ${M}_{\mathrm{pred}}^{\mathrm{cl}}/{M}_{\mathrm{truth}}^{\mathrm{cl}}$ . The projected mass of M^cl = 10 approximately corresponds to M₂₀₀ ∼ 7 × 10¹³ M_☉. The ratio of the CNN masses to the truth is consistent with unity ( ${0.867}_{-0.296}^{+0.327}$ ), while the ratio is significantly lower ( ${0.484}_{-0.149}^{+0.218}$ ) when the KS93 masses are used.
Download figure:
Standard image High-resolution image

3.4. Cluster Centroid

Robust estimation of centroids is an important issue in cluster WL studies (e.g., von der Linden et al. 2014; Randall et al. 2008). The centroid serves as a reference to characterize the properties of the cluster. Moreover, in merging galaxy clusters, the position of the mass clump with respect to other cluster components is critical in our reconstruction of their merging scenarios. Here we compare the performance in centroid recovery between the KS93 and our CNN methods.

We measured the centroid in two steps. First, we located the pixel that has the highest convergence value. Then, we applied a 21 pixel × 21 pixel (4 farcm 03 × 4 farcm 03) square top-hat window and evaluated the first moments. Occasionally, negative convergence values are present within the window in the KS93 mass reconstruction. To prevent the centroid from leaving the window in this case, we rescaled the mass map in such a way that the minimum value within the window becomes zero. The application of the top-hat window is to include the contribution from the large-scale structure around the peak in our estimation of the centroid.

Figure 7 displays the deviations of the reconstructed mass centroids with respect to the truth. The CNN and KS93 results give similarly small (1 ∼ 3 pixels) centroid deviations for massive clusters ( ${M}_{\mathrm{truth}}^{\mathrm{cl}}\gtrsim 35$ ). Remarkably, we find striking differences in the low-mass ( ${M}_{\mathrm{truth}}^{\mathrm{cl}}\lesssim 35$ ) regime. The CNN centroid deviations gradually increase for decreasing masses, reaching ∼10 pixels at ${M}_{\mathrm{truth}}^{\mathrm{cl}}\sim 10$ . On the other hand, the KS93 result shows many catastrophic errors (≳50 pixels) in this regime. This contrast is seen more clearly in Figure 8, where we directly compare the deviations for the same clusters.

**Figure 7.** Cluster centroid deviation (Δ_peak) as a function of the truth cluster mass ( ${M}_{\mathrm{truth}}^{\mathrm{cl}}$ ). Both CNN and KS93 perform well for massive clusters ( ${M}_{\mathrm{truth}}^{\mathrm{cl}}\gtrsim 35$ ). However, the KS93 method produces many catastrophic errors for ${M}_{\mathrm{truth}}^{\mathrm{cl}}\lesssim 35$ .
Download figure:
Standard image High-resolution image

**Figure 8.** Comparison of the centroid errors between the KS93 and CNN results. The data points are color-coded with the truth mass. Many catastrophic errors are present in the KS93 results.
Download figure:
Standard image High-resolution image

We attribute the large difference in the centroid deviations for low-mass clusters to the uncontrolled noise fluctuation in the KS93 mass reconstruction discussed in Section 3.1. As shown by the example in Figure 2, sometimes the highest convergence values are found not within the cluster region. Furthermore, even in the case where the highest convergence value is not catastrophically far from the truth, the lack of the contrast against the neighboring background substructures makes the centroid measurement highly uncertain.

3.5. Influence of Masking

Up to now, we have tested our our CNN method while assuming that no masked regions are present within the reconstruction field. In real observations, however, we need to mask out the regions affected by bright stars. Several methods have been suggested to minimize some artifacts due to the missing information (e.g., Starck et al. 2003; Pires et al. 2009). In this paper, without taking any explicit measure to minimize the influence (i.e., we did not perform a separate training with masked galaxy catalogs), we simply investigated the impact of large maskings on our CNN mass-reconstruction performance with the same model.

The expected number density of bright stars depends on the galactic latitude, and the exact size of the masking for a given magnitude star varies according to specific reduction/analysis methods. Reviewing our previous WL studies with Subaru/Suprime-Cam imaging data, we find that within the typical $20^{\prime} \times 20^{\prime}$ WL analysis area, 1 ∼ 2 bright-star maskings were needed with the masking radius ranging from $\sim 0\buildrel{\,\prime}\over{.} 5$ to $\sim 2^{\prime}$ (e.g., Finner et al. 2017; Kim et al. 2019; Yoon et al. 2020). To mimic such conditions, we applied bright-star masking to our source catalogs with these number density and size distributions. We ensured that every cluster has at least one masking near the mass peak because we are interested in investigating the effect at its maximum.

Our visual inspection of the result shows that in most cases, the CNN method can still detect the cluster mass clumps in the presence of masks. In Figure 9 we display one such example. Although the central mask completely removes source galaxies within the $r=2^{\prime}$ (∼0.73 Mpc) circular mask placed near the mass peak, the reconstruction can still reveal the cluster nearly at the truth position. However, we find that because of the missing data, the convergence values are slightly underestimated.

**Figure 9.** Impact of the bright-star masking on the CNN mass reconstruction. We use the same cluster as presented in Figure 2 as an example. The middle (right) panel shows the reconstruction without (with) bright-star masking. The color scale of the bottom panel is based on the normalized convergence ( $\widetilde{\kappa }$ ) as in Figure 2. Dashed white circles mark the locations of the bright-star masks. Although the central mask crops out the most significant region of the cluster, the result (CNN-BS) shows that the cluster is still clearly detected near the truth position. We note that the missing information leads to a slight underestimation of the peak convergence values.
Download figure:
Standard image High-resolution image

In order to examine the masking impact quantitatively, we measured the joint distribution, cluster mass comparison, and centroid distribution for the entire test sample as in Sections 3.2, 3.3, and 3.4, respectively. The joint κ distribution displayed in the left panel of Figure 10 clearly indicates that the correlation with the truth in the high-convergence regime is significantly weakened. Compared with the nonmasking case (right panel of Figure 4), the slope is reduced by a factor of 1.5 ∼ 2 at 0.2 ≲ κ_truth ≲ 0.5, which is consistent with our expectation from the visual inspection of the convergence map.

**Figure 10.** CNN performance under the influence of bright-star masking. Left panel: joint probability measured from the convergence pixels within the masks. Middle panel: comparison of projected cluster mass estimates with the truth. Right panel: comparison of centroid deviations with the KS93 ones performed without bright-star masking.
Download figure:
Standard image High-resolution image

Since this weakened correlation in κ is primarily due to the underestimation of the κ values within the mask placed near the cluster center, we can expect that the correlation in cluster mass also suffers in a similar fashion. The slope of the reconstructed mass to the truth becomes ${0.554}_{-0.211}^{+0.257}$ (see the middle panel of Figure 10), which is substantially smaller than the nonmasking case ${0.867}_{-0.296}^{+0.327}$ .

Finally, in terms of the centroid deviation, we find that the fraction of the catastrophic errors increases because the convergence values within the masked region are underestimated, and this causes the largest convergence peak within the reconstructed field to sometimes lie outside the masked area. The right panel of Figure 10 displays the comparison of the centroid deviation with the KS reconstruction result performed without any masking. Even in the low-deviation regime ( ${{\rm{\Delta }}}_{\mathrm{peak}}^{\mathrm{KS}}\lesssim 10$ pixels), the CNN method sometimes produces catastrophic errors. As mentioned earlier, this happens because of the in-mask underestimation. However, interestingly, in the regime where KS produces catastrophic errors ( ${{\rm{\Delta }}}_{\mathrm{peak}}^{\mathrm{KS}}\gtrsim 30$ pixels), the CNN performance is sometimes significantly better. This may happen for the cases where the in-mask underestimation is less severe than the KS93 artifacts, including noise amplification and inadequate κ-scale recovery (see the discussion in Section 3.1).

4. Discussion

4.1. Why Does Our CNN Algorithm Outperform the KS93 Method?

The comparison of our CNN mass reconstruction with the KS93 result has shown that the CNN performance is significantly better in several aspects (Section 3). To name a few, the bias in the projected cluster mass estimation based on the convergence map is greatly reduced, and the fraction of catastrophic errors in the centroid measurement becomes much smaller, especially in the low-mass regime. Moreover, the convergence map is adaptively smoothed in such a way that a larger kernel is used in regions where the lensing S/N is lower, which leads to effective noise suppression in the cluster outskirts. Here we present our discussion of the reason for the outperformance.

The main cause for the improvement can be understood if we review some of the key issues in the conventional mass reconstruction (Section 2.2). The mass-sheet degeneracy is the most fundamental problem because the shear γ remains unchanged under the transformation of the convergence field: κ → λ κ + (1 − λ). This degeneracy can be lifted only by imposing some specific κ value somewhere in the reconstruction field. One reasonable assumption is that the mean convergence is close to zero (although it should not be exactly zero) near the field boundary for a wide-field mass reconstruction. This allows us to determine the λ value and thus break the degeneracy. Because our training data sets are drawn from cosmological simulation data, we believe that our CNN learns to use the information.

Another critical issue is the nonlinearity in the g → κ mapping. The fact that while the average ellipticity $\left\langle e\right\rangle$ provides a reduced shear g = γ/(1 − κ), the convergence is a function of a shear γ (Equation (11)), which is ignored in the original KS93 formalism under the assumption that g ∼ γ (i.e., κ ≪ 1) in the very weak gravitational-lensing regime. Obviously, the condition κ ≪ 1 is invalid in the typical cluster environment. Several suggestions are present in the literature to implement the nonlinearity. For example, Seitz & Schneider (1996) suggest an iterative procedure by updating γ in Equation (11) with the information on κ in the previous step. One drawback in this approach might be noise amplification through the iteration. Therefore, some authors propose maximum likelihood-based methods with some regularization constraints (e.g., Seitz et al. 1998; Bradač et al. 2004; Jee et al. 2007). However, the fundamental limitation is that one needs κ on an absolute scale in order to correctly address the nonlinearity g = γ/(1 − κ). That is, the nonlinearity problem cannot be addressed in isolation. In CNN-based deep-learning algorithms, these nonlinear issues are routinely addressed, and many applications turn out to be promising (see, e.g., McCann et al. 2017; Rivenson et al. 2017; Lucas et al. 2018, and references therein). In fact, the development of the CNN algorithm is motivated to address nonlinear problems such as denoising, image restoration, deconvolution, super-resolution, medical image reconstruction, and holographic image reconstruction. Therefore, it is not surprising to observe that combined with the mass-sheet degeneracy-lifting capability, our CNN mass reconstruction significantly outperforms the original KS93 method.

4.2. Test with Real Observations: Application to the El Gordo Cluster Data

We have demonstrated that our CNN method can successfully reconstruct the projected mass maps from WL galaxy shears, with an overall performance significantly better than that of the classical KS93 algorithm. Now one of the important questions is how well the current CNN method would work given real observational data, where a number of additional issues such as shear calibration errors, instrumental signatures, and photometric redshift systematics are present. With further development in deep-learning and astronomical image generation tools, these issues may become tractable through end-to-end WL simulations in the future. Here we apply our CNN method to the HST WL data for the high-redshift merging cluster "El Gordo" (Jee et al. 2014; Menanteau et al. 2014). Within the current scope, we are interested in investigating the performance of our CNN method given the difference between the training data set and the real data in the following three aspects. First, our training was performed with the specific $32^{\prime} \times 32^{\prime}$ field size, whereas the HST field size ( $\sim 9^{\prime} \times 9^{\prime}$ ) of the El Gordo data is smaller by a few factors. Second, the training is done by assuming that every source galaxy has an identical source redshift. Obviously, the source population in the El Gordo field comes from a wide range of redshifts, and more importantly, it contains a significant fraction of nonbackground (contamination from cluster members and foreground objects) galaxies. Third, the source density in the training data set is 25 per sq. arcmin, approximately a factor of four lower than the source density (∼100 per sq. arcmin) in the HST observation of El Gordo.

Our HST catalog for El Gordo is provided by Kim et al. (2021), who studied the cluster with a new wide-field HST imaging data set (PROP ID: 14153, PI. Hughes). We refer to Kim et al. (2021) for details of the observation setup and reduction methods. In brief, the cluster was observed in four different programs (PROP IDs: 12477, 12755, 14096, and 14153). The entire FOV of the data with the addition of the last program (PROP ID: 14153) is ∼119 sq. arcmin, which covers the cluster beyond the virial radius r₂₀₀ ∼ 2 Mpc. With the combination of all existing programs, the resulting average source density is ∼95 per sq. arcmin.

Figure 11 displays the reconstructed mass map of the El Gordo cluster from our CNN model. The comparison with the KS93 version indicates that the advantages of the CNN method demonstrated in Section 3 with the simulated catalogs also manifest themselves here. First, the dynamical range of κ is more realistic in the CNN version. El Gordo is one of the extremely massive clusters in the Universe, and the projected convergence value should be κ ≳ 0.4 in the central region based on the effective source redshift of ∼1.2 (Kim et al. 2021) and the redshift of the cluster 0.87. The range of the convergence values in the CNN mass map is consistent with this expectation, although the relatively small field size does not allow us to lift the mass-sheet degeneracy completely. On the other hand, the peak convergence value in the KS93 case is too low. Second, the KS93 inversion generates a number of spurious features in the outskirts, whereas the CNN mass reconstruction efficiently suppresses these fluctuations. Currently, our multiwavelength data from X-ray to radio do not support the possibility that the features seen in the KS93 map might be real. Third, the two mass peaks are better resolved in the CNN mass reconstruction. In the KS93 mass map, although one can see the presence of the two mass maps, a bridge exists that connects the two mass peaks. Again, the existence of such a connecting substructure is not supported by our data.

4.3. Null Test

Although we take measures to prevent our CNN from learning that a cluster is always at the field center, our subsampling scheme for the generation of the training data set described in Section 2.3.2 still places the cluster always within the central ∼8.4% of the field (∼30% in each dimension). Therefore, our CNN model constructed from this training data set is expected to overestimate the convergence in the central region.

In order to quantify the bias, we performed a null test by generating 1000 random galaxy catalogs for the null (κ = 0) field and reconstructing the corresponding convergence fields with both the CNN and KS93 methods. We measured the projected mass from each convergence map using the r = 10 pixel circular aperture placed at the field center. Figure 12 compares the distributions of the masses from CNN and KS93. The KS93 result shows that the distribution is roughly symmetric around zero. On the other hand, our CNN-based mass clearly shows positive skewness. This null test demonstrates that our CNN leads to an overestimation of the convergence in the central region. Although not shown in Figure 12, we verify that the bias gradually decreases as we move the location of the aperture toward the edges.

**Figure 12.** Cluster mass null nest. We performed this null test by generating 1000 random galaxy catalogs for the null (κ = 0) field and reconstructing the corresponding convergence fields with both the CNN and KS93 methods. We measured the projected mass from the r = 10 pixel (∼0.72 Mpc) circular aperture placed at the field center. While the KS93 result shows that the distribution is roughly symmetric around zero, our CNN mass clearly shows positive skewness. However, even the cluster mass at the high-end tail ( ${M}_{\mathrm{pred}}^{\mathrm{cl}}\sim 10$ ) corresponds to an insignificant cluster mass of M₂₀₀ ∼ 7 × 10¹³ M_☉, which is below the typical WL detection limit.
Download figure:
Standard image High-resolution image

**Figure 12.** Cluster mass null nest. We performed this null test by generating 1000 random galaxy catalogs for the null (κ = 0) field and reconstructing the corresponding convergence fields with both the CNN and KS93 methods. We measured the projected mass from the r = 10 pixel (∼0.72 Mpc) circular aperture placed at the field center. While the KS93 result shows that the distribution is roughly symmetric around zero, our CNN mass clearly shows positive skewness. However, even the cluster mass at the high-end tail ( ${M}_{\mathrm{pred}}^{\mathrm{cl}}\sim 10$ ) corresponds to an insignificant cluster mass of M₂₀₀ ∼ 7 × 10¹³ M_☉, which is below the typical WL detection limit.
Download figure:
Standard image High-resolution image

We stress that this level of bias is insignificant in individual cluster mass estimations. For example, the projected mass at the high-end tail of the CNN distribution ${M}_{\mathrm{pred}}^{\mathrm{cl}}\sim 10$ corresponds to a cluster mass of M₂₀₀ ∼ 7 × 10¹³ M_☉, which is below the detection limit in typical ground-based WL studies. Nevertheless, we believe that future studies can reduce the bias substantially by improving the subsampling method and/or including blank fields in the training data set.

5. Conclusion

In this paper, we have introduced a new WL mass-reconstruction method based on CNN algorithms. Our CNN architecture consists of a series of 2D convolution and transposed-convolution layers with implementation of skip-connections between the input and the output of the convolution-transposed-convolution layers via multiplication operation. We generate training data sets using the ray-tracing data from cosmological simulations, while the statistical properties of the source galaxies are designed to match those in our typical WL studies with Subaru/Suprime-Cam images.

Compared with the original KS93 inversion, our CNN method produces significantly improved results. The merits include enhancement in the restoration of the dynamical range, agreement with the truth in both pixel-by-pixel and cluster mass comparisons, centroid determination, and noise suppression. In particular, it is remarkable that the slope of the recovered mass to the truth becomes consistent ( ${0.867}_{-0.296}^{+0.327}$ ) with unity for the test sample. The slope is much lower ( ${0.484}_{-0.149}^{+0.218}$ ) when we use the KS93 results instead. Furthermore, we find that the centroid estimation based on the CNN result is much more stable in the low-mass regime. We attribute these improvements to the efficient handling of the nonlinearity and degeneracy in our CNN algorithm, which have been plaguing the traditional mass-reconstruction methods, however.

The performance of our CNN algorithm somewhat degrades when a bright-star masking is fortuitously placed near the cluster center. Nevertheless, we find that the CNN reconstruction can still recover the cluster mass peak in most cases, and the overall performance is still better than the KS93 results.

We tested our CNN model using the HST WL catalog of the El Gordo cluster. Despite the difference between our training data set and the real data in field size, source density, and redshift distribution, the CNN method clearly resolves the two mass clumps of the merging cluster in excellent agreement with the cluster member distribution while suppressing the noise fluctuation in the outskirts.

Our study is the first implementation of an WL mass reconstruction with CNN methods. Although further refinements in both algorithm and simulation are needed before we use the method for a quantitative characterization of galaxy clusters, the result from this pilot study looks promising. One immediate application without the further improvements is shear-based galaxy cluster detection. Among the various selection methods, the shear-based galaxy cluster selection is unique in its ability to detect galaxy clusters with their projected masses. However, one of the most outstanding obstacles is the control of false positives due to noise fluctuation. As demonstrated throughout the paper, our CNN method suppresses this noise fluctuation efficiently while preserving the resolution in the high-density region, where the shear signal is high. In addition, because the projected κ values are useful mass proxies, the CNN method can be used to provide the first classification of clusters according to their masses. Finally, the CNN-aided centroid determination and its comparison with other multiwavelength data can enhance our substructure identification and also the reconstruction of merging scenarios in colliding clusters.

The authors thank Inkyu Park, Cristiano Sabiu, David Parkinson, and Min-su Shin for helpful discussions. S.E.H., S.P., and D.B. were (partly) supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2018R1A6A1A06024977). S.E.H. was also partly supported by the project 우주거대구조를 이용한 암흑우주연구 ("Understanding the Dark Universe Using Large Scale Structure of the Universe"), funded by the Ministry of Science. M.J.J. acknowledges support from the National Research Foundation of Korea under the program nos. 2017R1A2B2004644 and 2017R1A4A1015178. D.B. was also supported by NRF grant No. 2020R1A2B5B01001473. This work was also supported by the Korean Astronomy Machine Learning (KAML) working group.

This work is based on observations made with the NASA/ESA Hubble Space Telescope and operated by the Association of Universities for Research in Astronomy, Inc. under NASA contract NAS 5-2655. Computational data were transferred through a high-speed network provided by the Korea Research Environment Open NETwork (KREONET).

We thank the Columbia Lensing group (http://columbialensing.org) for making their simulations available. The creation of these simulations is supported through grants NSF AST-1210877, NSF AST-140041, and NASA ATP-80NSSC18K1093. We thank the New Mexico State University (USA) and the Instituto de Astrofisica de Andalucia CSIC (Spain) for hosting the Skies & Universes site for cosmological simulation products.

Facility: HST (ACT).

Software: Astropy (Astropy Collaboration et al. 2013), Keras (Chollet et al. 2015), Matplotlib (Hunter 2007), NumPy/SciPy (Virtanen et al. 2020), Pandas (McKinney 2010), SExtractor (Bertin & Arnouts 1996), Tensorflow (Abadi et al. 2015).

Appendix: Performance Test with Other CNN Architectures

We present performance tests of various CNN architectures using the diagnostics given in Section 3. Hereafter, the architecture presented in the main text of the current paper is referred to as FocalLoss. It uses the modified mean-square-error loss function inspired by the focal loss (Lin et al. 2017; see Equation (12)).

In addition to FocalLoss, we test the following five variations:

Fiducial: same as FocalLoss, except that the loss function is given by the standard MSE:
$\begin{eqnarray}&&{ \mathcal L }=\sum _{{\boldsymbol{x}}}{\left[{\kappa }_{\mathrm{pred}}({\boldsymbol{x}})-{\kappa }_{\mathrm{truth}}({\boldsymbol{x}})\right]}^{2}.\end{eqnarray} \tag{ 17 }$
This MSE is also used for the rest of the variations.
4Channel: same as Fiducial, except that the smoothed number distribution of background galaxies is used as an additional channel of the input layer.
19Filter: same as Fiducial, except that the employed filter size is 19 × 19 (instead of 49 × 49) during the convolution and transposed-convolution operations.
29Filter: same as Fiducial, except that the employed filter size is 29 × 29.
NoSkip: same as Fiducial, except that it uses no skip-connection.

Figure 13 compares the mass reconstructions from these various CNN architectures. Judging from the visual inspection, most CNN variations produce similar results. The exception is NoSkip, whose resolution is substantially compromised compared to the others. When we examined the results from the individual runs, the NoSkip runs frequently produce null results, where the convergence map is flat ( $\kappa ({\boldsymbol{x}})\approx \mathrm{constant}$ ). Moreover, the non-null results lack small-scale structures. This comparison illustrates that the ResNet-like skip-connection between convolution and transposed-convolution layers plays a crucial role in recovering the details. The 4Channel result does not show any significant merit over Fiducial. This indicates that the additional information about the source number distribution does not meaningfully contribute to the mass-reconstruction quality. In this example and in others as well, we find that the 19Filter results tend to slightly overestimate the densities near the field edges compared to those produced with the larger (29 × 29 or 49 × 49) filters, although the difference becomes insignificant when we compare the 29 × 29 versus 49 × 49 cases. This implies that there may exist a lower threshold in filter size in order to properly restore the dynamic range. Tables 3 and 4 summarize the comparisons among different CNN branches.

Table 3. Same as Table 2, Including Various CNN Architectures

Model	${ \mathcal D }({\widetilde{\kappa }}_{\mathrm{pred}},{\widetilde{\kappa }}_{\mathrm{truth}})$	${M}_{\mathrm{pred}}^{\mathrm{cl}}/{M}_{\mathrm{truth}}^{\mathrm{cl}}$	Δ_peak
KS93	6.26 ± 4.57	${0.484}_{-0.149}^{+0.218}$	$4\buildrel{\,\prime}\over{.} {27}_{-4/27}^{+8/63}$

Fiducial	4.59 ± 4.12	${0.762}_{-0.266}^{+0.305}$	${\bf{0}}\buildrel{\,\prime}\over{.} {{\bf{55}}}_{-0/36}^{+4/19}$
4Channel	$\underline{4.62\pm 4.50}$	${0.789}_{-0.263}^{+0.289}$	$0\buildrel{\,\prime}\over{.} {59}_{-0/37}^{+3/66}$
19Filter	4.09 ± 4.32	$\underline{\Space{0ex}{1.0ex}{0ex}{0.720}_{-0.238}^{+0.314}}$	$\underline{0\buildrel{\,\prime}\over{.} {73}_{-0/40}^{+1/19}}$
29Filter	4.32 ± 4.16	${0.754}_{-0.245}^{+0.305}$	$0\buildrel{\,\prime}\over{.} {67}_{-0/44}^{+7/31}$
FocalLoss	4.36 ± 3.77	${\bf{0}}.{{\bf{867}}}_{-0.296}^{+0.327}$	$0\buildrel{\,\prime}\over{.} {60}_{-0/38}^{+4/92}$

NoSkip	5.04 ± 3.79	$-{1.329}_{-0.842}^{+0.457}$	$0\buildrel{\,\prime}\over{.} {73}_{-0/40}^{+1/19}$

Note. The CNN model used in the main text is FocalLoss, and see text for the definition of each architecture. The architecture with the best performance for each parameter is marked in bold characters, while the architecture with the worst performance for each parameter is underlined. KS93 and NoSkip are not underlined because they always show poorer performances than the other architectures.

Download table as: ASCII Typeset image

Table 4. Same as Table 3, but for Bright-star Masking (see Section 3.5)

Model	${ \mathcal D }({\widetilde{\kappa }}_{\mathrm{pred}},{\widetilde{\kappa }}_{\mathrm{truth}})$	${M}_{\mathrm{pred}}^{\mathrm{cl}}/{M}_{\mathrm{truth}}^{\mathrm{cl}}$	Δ_peak
Fiducial-BS	4.22 ± 3.50	${0.448}_{-0.182}^{+0.217}$	$2\buildrel{\,\prime}\over{.} {60}_{-2/12}^{+4/09}$
4Channel-BS	$\underline{4.93\pm 3.93}$	${0.400}_{-0.135}^{+0.178}$	$4\buildrel{\,\prime}\over{.} {20}_{-3/67}^{+4/48}$
19Filter-BS	4.91 ± 3.90	$\underline{\Space{0ex}{1.0ex}{0ex}{0.386}_{-0.149}^{+0.191}}$	$\underline{6\buildrel{\,\prime}\over{.} {05}_{-5/48}^{+4/99}}$
29Filter-BS	4.29 ± 3.55	${0.421}_{-0.193}^{+0.188}$	$2\buildrel{\,\prime}\over{.} {55}_{-2/11}^{+6/09}$
FocalLoss-BS	3.66 ± 3.22	${\bf{0}}.{{\bf{554}}}_{-0.211}^{+0.257}$	${\bf{1}}\buildrel{\,\prime}\over{.} {{\bf{53}}}_{-1/14}^{+4/21}$