research-article

Open Access

The Generation of Visually Credible Adversarial Examples with Genetic Algorithms

Authors:
James R. Bradley

William & Mary, Williamsburg, VA

William & Mary, Williamsburg, VA

0000-0001-8766-6830
View Profile

,
A. Paul Blossom

William & Mary, Williamsburg, VA

William & Mary, Williamsburg, VA

0000-0001-8787-2263
View Profile

ACM Transactions on Evolutionary Learning and Optimization Volume 3 Issue 1Article No.: 2pp 1–44https://doi.org/10.1145/3582276

Published:29 March 2023Publication History

ACM Transactions on Evolutionary Learning and Optimization

Abstract

An adversarial example is an input that a neural network misclassifies although the input differs only slightly from an input that the network classifies correctly. Adversarial examples are used to augment neural network training data, measure the vulnerability of neural networks, and provide intuitive interpretations of neural network output that humans can understand. Although adversarial examples are defined in the literature as similar to authentic input from the perspective of humans, the literature measures similarity with mathematical norms that are not scientifically correlated with human perception. Our main contributions are to construct a genetic algorithm (GA) that generates adversarial examples more similar to authentic input than do existing methods and to demonstrate with a survey that humans perceive those adversarial examples to have greater visual similarity than existing methods. The GA incorporates a neural network, and we test many parameter sets to determine which fitness function, selection operator, mutation operator, and neural network generate adversarial examples most visually similar to authentic input. We establish which mathematical norms are most correlated with human perception, which permits future research to incorporate the human perspective without testing many norms or conducting intensive surveys with human subjects. We also document a tradeoff between speed and quality in adversarial examples generated by GAs and existing methods. Although existing adversarial methods are faster, a GA provides higher-quality adversarial examples in terms of visual similarity and feasibility of adversarial examples. We apply the GA to the Modified National Institute of Standards and Technology (MNIST) and Canadian Institute for Advanced Research (CIFAR-10) datasets.

1 INTRODUCTION

Szegedy et al. [2014] coined the term Adversarial Example (AE) for an input that a neural network misclassifies although it differs only slightly from an input that the network classifies correctly. A neural network’s classification of an input instance would ideally not change abruptly with a slight modification of the input, and networks that frequently exhibit this behavior are less reliable for classifying authentic input and more vulnerable to being intentionally fooled by malevolent input. Such a network is described as lacking robustness. An AE is always defined relative to some unmanipulated, authentic input, which we call the reference input. Further, the difference in the classification of the two entities is always with respect to a particular neural network. An AE may be classified differently from the reference input by one neural network, but not by another. Formally stated, an optimal AE, \(x^{\prime *}\), is an input to a neural network (or other classification model) that minimizes the distance from the authentic reference input, \(x\), as measured by some \(p\)-norm while being classified differently from the reference input, \( \begin{equation*} x^{\prime *} = \arg \min _{w \in {\mathcal {S}} }{ \left\Vert x - w \right\Vert _p } \quad \mbox{s.t. } L\left(w \right) \ne L\left(x \right), \end{equation*} \)

where \({\mathcal {S}}\) is the space of feasible solutions, \(L \left(w \right)\) indicates the class to which a candidate solution \(w\) is assigned by the classification model, and \(L \left(x \right)\) is the ground truth label of the reference input \(x\). Here, the prime indicates that \(x^{\prime *}\) is an AE and the asterisk indicates it is an optimal one.

Some of the literature defines the term AE with human perception as the gauge of the similarity between an input and its AE [Brown et al. 2017; Chen et al. 2020; Goodfellow et al. 2014; Tramèr et al. 2017; Vidnerová and Neruda 2020]. Their view of an AE is that humans must perceive little or no difference between it and a reference input. The extant research, nonetheless, measures the differences between authentic inputs and AEs with mathematical norms, specifically the \(L_1\), \(L_2\), and \(L_\infty\) norms (e.g., Madry et al. [2017]; Sharma and Chen [2018]). To our knowledge, no research assesses the visual similarity of reference inputs and potential AEs based on direct human perception despite its frequently being invoked in the definition of an AE.

Human perception of the difference between an authentic input and its AE may not always be important, for example, where neural networks programmatically receive automated input and the decisions constituted by their classifications are perceived to have little or no impact on humans. In other circumstances, human interaction with input data and the output from Artificial Intelligence (AI) methods is needed to ensure equity. Growing concern over bias in algorithms and AI methodologies and their lack of transparency have motivated calls for human oversight of AI in criminal sentencing, hiring, and parole [Gibson-Rankin 2021; Kirkpatrick 2017; Kuncel et al. 2014; Mann and O’Neil 2016; Spielkamp 2017]. One such approach is for a human explainer to provide logical interpretations of AI output by comparing it to other outputs and their respective inputs [Wilson and Daugherty 2018], where the inputs might be real or fabricated. Equitable treatment can be documented where outputs are the same for similar real inputs. The latter class of inputs is called counterfactuals, as studied by Wachter et al. [2018] and Sharma et al. [2019]. They develop counterfactual inputs that demonstrate what particular and significant differences are required for the classification of an input to change, which provide tangible and intuitive explanations for those affected by the outcomes, nontechnical managers, administrators, and other consumers of AI. Therefore, counterfactuals are akin to AEs because both ideally reflect the minimal change to an input that results in a different classification.

In other circumstances, human interaction with input data and the output from AI methods is required by law. The European Union’s General Data Protection Regulation (EU) 2016/679 [European Parliament and Council of the European Union] gives any person, by law, who is significantly affected by an automated decision the “right to explanation” from a human and a means to contest the decision [Casey et al. 2019]. The California Consumer Privacy Act (CCPA) is patterned after the General Data Protection Regulation but does not apparently provide for a right to explanation, although it may be a harbinger of further laws requiring people to interpret inputs and explain outcomes. Whether motivated by ethics or required by law, promoting consistency with explainers or counterfactuals is important where decisions have a significant effect on people, such as in decisions regarding parole, loans, and hiring. As a consequence, it is important to incorporate human perception of AEs and counterfactuals into the literature.

The first research question we address is whether a Genetic Algorithm (GA) can generate AEs and, furthermore, whether those AEs are perceived by humans to have greater visual similarity with an authentic input than those generated with existing methods. We gauge the visual similarity of AEs by surveying human subjects. We are the first, to our knowledge, to evaluate visual similarity of AEs empirically from the human perspective. This survey data enables us to compare the performance of GAs and existing methods for generating AEs and also identify the best fitness functions, selection operators, and mutation operators for generating visually similar AEs with a GA. Specifically, we identify the fitness function that is most highly correlated with the human subjects’ perception of visual similarity, which can be redeployed in future research to incorporate human perspective. We also demonstrate that GAs can be used in complementary fashion with another AI technique, namely neural networks.

Our GA has features that we hypothesized would permit the generation of AEs similar to the reference input. First, the GA maintains a population of feasible AEs, which means that the reference neural network affixes labels that are different from the ground truth label of the reference input. Existing methods for generating AEs employ gradients of the reference neural network and include the Fast Gradient Sign Method (FGSM) [Goodfellow et al. 2014] and the Projected Gradient Descent (PGD) method [Madry et al. 2017]. These methods perturb the reference input by taking a step (or steps) away from the reference input toward an input that the neural network would classify differently. The FGSM and PGD methods take one or a fixed number of steps, respectively, but they do so without checking whether the reference neural network’s classification changes. Therefore, gradient methods do not guarantee valid AEs but prioritize speed over validity by avoiding the expensive prediction computation.

Gradient methods also do not prioritize minimizing the differences between a candidate AE and the reference input. The step size and the number of steps are fixed and deterministic. It is possible that a smaller step with the FGSM or fewer or smaller steps with the PGD method would produce a valid AE more similar to the reference input. However, dynamically adapting the size and number of steps taken depending on whether the classification changes requires classifying the perturbed input with the neural network, possibly many times. The implication of avoiding this classification computation is that gradient methods, again, prioritize speed rather than similarity. Indeed, research has broadly prioritized speed [Goodfellow et al. 2014; Wong et al. 2020; Zhang et al. 2021]. GAs, conversely, prioritize minimizing the difference between the best-fit AE and the reference input by using an appropriate fitness function and selection operator to cause increasing fitness across a population and generally in the best-fit member.

The GA guarantees valid AEs that, with appropriate parameters, approximate the minimum perturbation required to change the classification of a reference input. To the former attribute, valid AEs are essential for effectively explaining neural network output with counterfactuals: the neural network must actually assign a different classification to the counterfactual for the difference in inputs to indicate what is required for a decision to change. Measuring neural network robustness requires both a minimal difference and validity as well. If an AE is not close to a reference input, then we have a poor estimate of a network’s sensitivity. If the candidate AE has the same classification as the reference input, then we cannot use it to measure robustness. The downside of GAs relative to gradient methods is their slower speed.

We generate AEs for images from two datasets, the first of which is the Modified National Institute of Standards and Technology (MNIST) dataset [Deng 2012], which can be downloaded from the MNIST database ( http://yann.lecun.com/exdb/mnist/) [LeCun et al. 2020]. MNIST images are grayscale images of handwritten digits from 0 to 9. Using the well-researched MNIST dataset permits us to compare our GA against other methods for generating AEs and also to use existing neural network architectures for the reference neural network in our GA. We also generated AEs for the Canadian Institute for Advanced Research (CIFAR-10) dataset [Krizhevsky et al. 2022], which is similarly well studied but which contains more complex images than does MNIST.

The rest of the article is organized as follows. We begin by providing a brief overview of the literature on AEs in Section 2. In Section 3, we provide an overview of the MNIST dataset and describe our GA for it. In Section 4, we describe existing gradient methods for generating AEs whose performance we compare with our algorithm. In Section 5, we describe our survey for assessing human perception of visual similarity. In Section 6, we analyze the survey data regarding MNIST reference images and their AEs. In Sections 7 and 8, we describe the GA for the CIFAR-10 dataset and analyze the survey results, respectively. We conclude in Section 9.

2 RELATED LITERATURE

AEs are used to measure the robustness of neural networks [Bastani et al. 2016; Hendrycks and Dietterich 2019; Zhang et al. 2018], to mount and defend against adversarial attacks [Brendel et al. 2018; Carlini and Wagner 2017; Chen et al. 2020], and for adversarial training that seeks to reduce the susceptibility of neural networks to adversarial attacks by augmenting the training set with AEs [Kurakin et al. 2017; Madry et al. 2017]. More recently, AEs and counterfactuals have been used to support intuitive explanations of neural network output. Note that the first three of these purposes are security related, whereas the last is for administering to those affected by neural networks. We will discuss later how these different purposes might be best satisfied by different approaches for generating AEs.

Prior research has offered readers visual comparisons of a small number of authentic inputs and their AEs (e.g., Madry et al. [2017]) and reported mathematical distances between reference inputs and AEs, but to our knowledge, visual similarity has not been assessed with human subjects. Sharma and Chen [2018] express concern regarding visual similarity, specifically calling into question the use of the infinity norm as used in the work of Madry et al. [2017], and they use a neural network with elastic-net regularization with the goal of improving visual similarity but still do not evaluate the human perspective with humans.

The mathematical distances between AEs and their reference inputs have been measured most often with either the \(L_1\), \(L_2\), or \(L_\infty\) norms (e.g., Madry et al. [2017], Sharma and Chen [2018]). The \(L_\infty\) and \(L_2\) norms have also been used to constrain the maximum perturbation of pixel values in an AE from corresponding pixel values in the reference input (e.g., Tramèr et al. [2017], Goodfellow et al. [2014]). Wachter et al. [2018] have also introduced a function whose intent is to reduce the difference that humans perceive between reference inputs and AEs. These four functions are the basis for the fitness functions we evaluate.

Wachter et al. [2018] use a neural network to generate counterfactuals for intuitively explaining neural network output to avoid technical explanations that are not accessible to non-technical people affected by the output. Counterfactual-based explanations also mitigate the need to divulge details of neural networks, which may be valuable intellectual property. Given the close association between counterfactuals and AEs, our work is related in that we also provide an automatic means for generating AEs, but we do so with a GA rather than a neural network. We do so, moreover, for a different type of data.

GAs have been used in only a few instances to generate AEs, but never with a focus on visual similarity, integration with neural networks, and the optimization of algorithm parameters. Vidnerová and Neruda [2020] used GAs to generate AEs for neural networks and other machine learning techniques, such as support vector machines and decision trees. Their focus is on showing the general applicability of GAs across machine learning methods rather than on visual similarity, which is our focus. GAs have also been employed to generate AEs in other contexts, including natural language and audio inputs [Alzantot et al. 2018a, 2018b; Taori et al. 2019; Yang et al. 2019], but the contexts of these works are significantly different from ours and the GAs, accordingly, differ significantly from ours.

GAs provide heuristic solutions to optimization problems by mimicking the evolution of a population of candidate solutions (phenotypes) over many generations. Portions of candidate solutions in one generation of a GA are combined in their offspring who populate the next generation, just as the genetic material of parents of all living organisms are recombined. The quality of candidate solutions are measured by a fitness function that is improving through the generations, although not always monotonically so, consistent with “survival of the fittest.” The mechanisms driving this improvement are how parents are selected for mating (selection), how parents’ genes are recombined (crossover), and the mutation of genes within the resulting phenotypes. GAs have subsequently been applied to a broad set of problems, particularly to complex problems where classical optimization methodologies are not effective. We were motivated to generate AEs with GAs because that problem is reasonably complex, and candidate solutions are easily represented as phenotypes that are amenable to crossover and mutation. Pioneering work was done in GAs by Holland [1975], and the subsequent literature is vast. We defer a more detailed review of this literature until it can be brought to bear directly on our GA structure and its attributes.

3 GA FOR THE MNIST DATASET

3.1 MNIST Dataset

We investigate the generation of AEs for the MNIST dataset because it is well studied and methods for generating AEs for MNIST data are therefore available as a benchmark for our method. The MNIST dataset contains images of handwritten numerical digits from 0 to 9 [Deng 2012; LeCun et al. 2020], for which classification models are often created with the goal of classifying the images according to their ground truth labels. The dataset contains 60,000 training images and 10,000 test images. Neural networks are typically trained on the former set, and the accuracy of the resulting model is tested on the latter set. Each MNIST image consists of 784 pixels, which are arranged in a \(28 \times 28\)array when viewed as shown in Figure 1. Each pixel is encoded on a grayscale using integers 0 through 255, although these integer quantities are most often converted to floating-point values on \(\lbrack 0.0, 1.0 \rbrack\) to improve the training of neural networks, as we do.

Fig. 1. A sample image from the MNIST dataset.

3.2 A GA for Generating AEs

GAs evolve a population of candidate solutions over many generations by selecting parents from the population for reproduction based on their fitness in each generation, combining the parents into an offspring (crossover), and then (possibly) mutating the offspring. (Algorithm 1 presents a canonical representation of the steps in a GA.) For those less familiar with GAs, we recommend reviewing the work of Goldberg [1989] as a means to gain a foundation. The parameters for how fitness is computed, parents are selected, mutation is performed, and how the feasibility of the solution is enforced vary by application. We describe in this section the specific algorithm parameters we used to operationalize Algorithm 1 and our motivation for those choices in the context of the MNIST dataset. The GA for the MNIST dataset is formally stated in Algorithm 2, and its notation is summarized in Table 1 and described in this and the subsequent sections.

Table 1.

Notation	Description
\(\ell _p\)	Vector norm where \(p \in \left\lbrace 1,2, \infty \right\rbrace\)
\({\mathcal {D}}\)	Set of indices for MNIST images, \(\left\lbrace 1, \ldots, 60000 \right\rbrace\)
\({\mathcal {P}}\)	Set of pixel indices, \(\left\lbrace 1, \ldots, 784 \right\rbrace\)
\(N\)	Population size
\(G\)	Number of generations
\({\mathcal {C}}\)	Set of indices for population candidate solutions images, \(\left\lbrace 1 , \ldots , N \right\rbrace\)
\(\mbox{MAD}_j\)	Median absolute difference of pixel \(j \in {\mathcal {P}}\) across MNIST from the median pixel \(j\) value
\(\underline{\mbox{MAD}}\)	Lower limit on \(\mbox{MAD}_j \forall j \in {\mathcal {P}}\) imposed to ensure finite distance \(d_{\texttt {MAD}} \left(\cdot \right)\)
\(\widetilde{\mbox{MAD}}_j\)	\(\mbox{MAD}_j\) after adjustment for \(\underline{\mbox{MAD}}\)
\(d_{\texttt {MAD}} \left(x , x^\prime \right)\)	distance between images \(x\) and \(x^\prime\) normalized by MAD
\(x^i\)	MNIST image with index \(i \in {\mathcal {D}}\); \(x^i \in \lbrack 0.0 , 1.0 \rbrack ^{784}\)
\(x^i_j\)	Pixel \(j \in {\mathcal {P}}\) of MNIST image \(i \in {\mathcal {D}}\)
\(x^{k \prime }\)	AE \(k\) within population, \(k \in {\mathcal {C}}\)
\(x^{\prime }_j\)	Pixel \(j \in {\mathcal {P}}\) of best-fit or arbitrary AE in the population
\(L \left(x \right)\)	The reference neural network classification (label) of image \(x\)
\({\bf urv}\)	Uniform random variable on \(\lbrack 0.0, 1.0 \rbrack\)

View Table

Table 1. Notation

We use \(x^i\) to represent the MNIST image with index \(i\) from the set of indices for the MNIST training set \({\mathcal {D}}\): \(i \in {\mathcal {D}} \equiv \left\lbrace 1 , \ldots , 60000 \right\rbrace\). We represent the native MNIST \(28 \times 28\) arrays as flattened arrays of 784 elements in the GA to facilitate simple crossover mechanisms. Thus, \(x^i \in \lbrack 0.0 , 1.0 \rbrack ^{784}\). The GA addresses one arbitrary MNIST image at a time, so we can often simplify exposition by suppressing the index and writing simply \(x\). The GA population has \(N\) members, and their indices are contained in the set \({\mathcal {C}}\). (We suppress notation for the generation number.) Each AE is written as \(x^{k \prime }\) for \(k \in {\mathcal {C}}\) with \(x^{k \prime } \in \lbrack 0.0 , 1.0 \rbrack ^{784}\). We define the best-fit AE from the population as \( \begin{equation*} x^\prime = \lbrace x^{k \prime } | f (x^{k \prime } ) \ge f (x^{m \prime } ), k \in {\mathcal {C}} , \forall m \in {\mathcal {C}} \rbrace , \end{equation*} \) where \(f\left(\cdot \right)\) is a fitness function. We refer to the pixel values using subscripts—for example, \(x_j\) and \(x^{\prime }_j\) for \(j \in {\mathcal {P}}\) for the reference MNIST image and the best-fit AE, respectively.

3.2.1 Fitness Functions.

We consider the three norms prevalent in the MNIST neural network literature and a function introduced by Wachter et al. [2018]. We also consider a second, linearized version of each of these based on the work of Mannino and Koushik [2000]. Eight fitness functions were considered in total.

The three norms frequently used in the MNIST neural network literature to measure distance between a MNIST image \(x\) and an AE, \(x^\prime\), are \(\begin{eqnarray*} \ell _1 \left(x , x^\prime \right) & = & \Vert x - x^\prime \Vert _1 \\ & = & \sum _{j \in {\mathcal {P}}}{ \left| x_j - x^\prime _j \right|} \\ \ell _2 \left(x , x^\prime \right) & = & \Vert x - x^\prime \Vert _2 \\ & = & \left\lbrack \sum _{j \in {\mathcal {P}}}{ \left(x_j - x^\prime _j \right)^2} \right\rbrack ^{\frac{1}{2}} \\ \ell _\infty \left(x , x^\prime \right) & = & \Vert x - x^\prime \Vert _\infty \\ & = & \max _{j \in {\mathcal {P}}}{ \left| x_j - x^\prime _j \right|} , \end{eqnarray*}\) where the subscript \(j\) indicates pixel indices, \(j \in {\mathcal {P}} \equiv \left\lbrace 1 , \ldots , 784 \right\rbrace\). Although \(\ell _\infty\) is used most often as a constraint on maximum pixel value difference between a MNIST image and its AE, we considered it as a plausible objective function because constraints in optimization problems are sometimes absorbed into objective functions via Lagrangian multipliers.

Wachter et al. [2018] applied their distance function to numerical and categorical college admissions data to generate counterfactuals that humans would find plausibly similar to a reference input. We apply it here to purely numerical inputs. This function permits more difference between counterfactual and original MNIST pixel values when those pixel values exhibit more variation in the original dataset. In the MNIST context, for example, all image borders are black and any border pixels other than black in an AE would be an obvious deviation from a reference image. This function, accordingly, reduces the fitness function more significantly when border pixels vary from the ubiquitous black hue than for deviations from the reference image at other pixel locations where the dataset exhibits greater variation.

We apply the method from Wachter et al. [2018] to our context by measuring pixel value variation in MNIST at each pixel position based on the distribution of absolute differences between the pixel values and the median pixel value at that position. The median of that distribution is used as an indication of the variation: \(\begin{equation*} \mbox{MAD}_j = \mbox{median}{ \left\lbrace \left| x^m_j - \mbox{median}{ \left\lbrace x^i_j , \forall i \in {\mathcal {D}} \right\rbrace } \right| , \forall m \in {\mathcal {D}} \right\rbrace } . \end{equation*}\) A distance function between an AE and its MNIST reference can then be computed as the sum of absolute pixel differences normalized by \(\mbox{MAD} \equiv \lbrace \mbox{MAD}_j , j \in {\mathcal {P}}\rbrace\): (1) \(\begin{equation} \sum _{j \in {\mathcal {P}}}{\frac{| x_j - x^\prime _j|}{\mbox{MAD}_j}} . \end{equation}\)

Note that any \(\mbox{MAD}_j\) can be 0.0, however, if the pixels at that position \(j\) in all MNIST images are identical, as is the case for some border pixels. We avoid numerical computation issues in (1) by imposing a lower bound on MAD, \(\underline{\mbox{MAD}}\), to ensure a nonzero denominator, \(\begin{equation*} \widetilde{\mbox{MAD}}_j = \max {(\underline{\mbox{MAD}} , \mbox{MAD}_j)}, \end{equation*}\) and then rewrite the distance in (1) between an AE \(x^\prime\) and a MNIST image \(x\) as \( \begin{equation*} d_{\texttt {MAD}} \left(x , x^\prime \right) = \sum _{j \in {\mathcal {P}}}{\frac{\left| x_j - x^\prime _j \right|}{\widetilde{\mbox{MAD}}_j}} . \end{equation*} \) We will refer to this distance as MAD_Dist. This function decreases when differences in pixel values between two images decrease, but it is not a norm.

The intent of normalizing by \(\widetilde{\mbox{MAD}}_j\) is to promote visual similarity as can be demonstrated by considering the contribution of two pixel indices, \(j_1\) and \(j_2\), to \(d_{\texttt {MAD}} \left(x , x^\prime \right)\). Assume, for example, that the distributions of pixel values at these two indices, \(j_1\) and \(j_2\), are uniformly distributed on \([0.3, 0.7]\) and \([0.0, 1.0]\), respectively, with both medians equal to 0.5. Then, the lesser variation in the former index is reflected in \(\mbox{MAD}_{j_1} = 0.1\) versus \(\mbox{MAD}_{j_2}=0.25\) for the latter. Normalizing by MAD, in effect, creates a greater multiplier on pixel differences at \(j_1\) than at \(j_2,\) and using \(d_{\texttt {MAD}} \left(x , x^\prime \right)\) as a fitness function will put a greater emphasis on reducing the pixel difference at \(j_1\).

The canonical GA maximizes its fitness function while we want to minimize \(d_{\texttt {MAD}} \left(x , x^\prime \right)\) and the other norms. One way to recast distance as a fitness function is to maximize its reciprocal, as with these fitness functions: \(\begin{eqnarray*} f_1^R \left(x , x^\prime \right) & = & \frac{1}{\ell _1\left(x , x^\prime \right)} \\ f_2^R \left(x , x^\prime \right) & = & \frac{1}{\ell _2\left(x , x^\prime \right)} \\ f_\infty ^R \left(x , x^\prime \right) & = & \frac{1}{\ell _\infty \left(x , x^\prime \right)} \\ f^R_{\texttt {MAD}} \left(x , x^\prime \right) & = & \frac{1}{d_{\texttt {MAD}} \left(x , x^\prime \right)} , \end{eqnarray*}\) where the superscript \(R\) indicates the reciprocal of distance.

Another approach for creating fitness functions where “greater values are better” is to create a linear function by subtracting the distance from its maximum value as done by Mannino and Koushik [2000]: \(\begin{eqnarray*} f_1^L \left(x , x^\prime \right) & = & \max _{y \in \lbrack 0.0 , 1.0 \rbrack ^{784}}{\ell _1 \left(x , y \right)} - \ell _1\left(x , x^\prime \right) \\ f_2^L \left(x , x^\prime \right) & = & \max _{y \in \lbrack 0.0 , 1.0 \rbrack ^{784}}{\ell _2 \left(x , y \right)} - \ell _2\left(x , x^\prime \right) \\ f_\infty ^L \left(x , x^\prime \right) & = & \max _{y \in \lbrack 0.0 , 1.0 \rbrack ^{784}}{\ell _\infty \left(x , y \right)} - \ell _\infty \left(x , x^\prime \right) \\ f^L_{\texttt {MAD}} \left(x , x^\prime \right) & = & \max _{y \in \lbrack 0.0 , 1.0 \rbrack ^{784}}{d_{\texttt {MAD}} \left(x , y \right)} - d_{\texttt {MAD}} \left(x , x^\prime \right) , \end{eqnarray*}\) where the superscript \(L\) indicates a linear function of distance. The maximum distances are easily computed.

We evaluate both linear and reciprocal methods because selective pressure and convergence are sensitive to the relative fitness function values among the population, which are different with linear versus reciprocal functions. The reciprocal form causes the fittest phenotypes to have disproportionately better fitness values than with linear functions, thus increasing selective pressure. Whether the increased selective pressure is appropriate is a matter to be determined by experimentation.

3.2.2 Overall Selection.

Selection by proportional fitness, tournament selection, and ranking have been compared in the literature in terms of solution quality, computation complexity, and time for convergence. The computational complexity of computing fitness and the selection operator are dominated in our context by the computation time of classifying the population of AEs with the neural network, which is required for maintaining a feasible population. Accordingly, we prioritize solution quality and convergence.

Ranking selection converges more quickly on traveling salesperson problems than other operators [Razali et al. 2011], but proportional fitness has achieved comparable solution quality to ranking selection for small problems [Bala 2017; Razali et al. 2011]. Tournament selection converges more slowly, although it has less computational complexity [Razali et al. 2011] than proportionate and ranking selection, which have similar complexity [Goldberg and Deb 1991]. The computation time in each generation of our GA is dominated by classifying AEs with a neural network, so the complexity advantage offered by tournament ranking is likely to be less significant than the advantage of faster convergence so that fewer generations are required. We, consequently, included proportionate and ranking selection in our study.

In proportionate selection, we compute the probability of selecting each phenotype from the population, \(k \in {\mathcal {C}}\), as the ratio of the fitness of that \(k\)-th phenotype to the sum of the fitnesses over the entire population, when generating an AE for MNIST image \(x\), (2) \(\begin{equation} p^P_{*, \dagger , x} \left(k \right) = \frac{f_*^\dagger (x , x^{k^\prime })}{\sum _{k \in {\mathcal {C}}}{f_*^\dagger (x , x^{k \prime })}}, \end{equation}\) where \(P\) stands for proportionate ranking, \(*\) represents fitness function with either \(0, 1, \infty ,\) or \(\mbox{MAD}\), and \(\dagger \in \left\lbrace L,R \right\rbrace\) indicates either the linear or reciprocal form of a fitness function.

Baker [1985] suggested selection based on the rank of a population member’s fitness within the population. A variety of functions can be used to compute probability of selection from rank, including linear and nonlinear ones. We implemented a linear model after Baker and a nonlinear (exponential) model from Blickle and Thiele [1996]. Syswerda [1991] used a similar exponential function, but for deleting population members rather than selecting them for crossover. Assuming that the population members are sorted in ascending order of fitness and assigned an integer rank from the sequence \(k \in \left\lbrace 1 , \ldots , N \right\rbrace\) in that order, then the probability of selecting population member \(k\) for crossover in our linear model, as denoted with the superscript \(L\), is (3) \(\begin{equation} p_k^L = \frac{2k}{N \left(N+1 \right)} . \end{equation}\) This implies an expected number of the best and worst population members in the next generation of \(\frac{2N}{ N + 1 }\) and \(\frac{2}{ N + 1 }\), respectively, with a linearly decreasing expected number of selections between those two endpoints. The expected number of times that the fittest population member is selected for crossover is approximately 2. We implement nonlinear ranking with a parameter \(q\) as in the work of Blickle and Thiele [1996]: (4) \(\begin{equation} p_i^{NL} = \frac{q^{N-i}}{\sum _{i=1}^{N}{q^{N-i}}} , \end{equation}\) where the superscript \(NL\) denotes a nonlinear ranking scheme.

3.2.3 Elitist Selection.

Elitist selection is a tactic where one or more of the best-fit candidate solutions are retained from one generation to the next. Elitist selection tends to increase the convergence rate by making the best phenotypes persistently available for selection. We use elitist selection in a novel way, however, such that it may also contribute to increased diversity. Specifically, any AE is classified in one of 9 classes (10 minus 1), and we retain a single best-fit AE from each of the classes represented in the population, which will exclude the class of the reference MNIST image. The best-fit solution in the population is among those retained, but the best-fit solutions from other classes may not have good fitness levels and therefore contribute to diversity and the exploration of possible multiple optimal solutions, albeit at a slowed convergence rate similar to the effects of “inverse elitist selection” as described in the work of Corus et al. [2021].

GAs applied to constrained optimization problems can be designed to either permit infeasible candidate solutions in the population or not [Vidnerová and Neruda 2020]. In the former tactic, the fitness function might apply a penalty to infeasible phenotypes. In the latter tactic, as we have implemented, candidate solutions that are made infeasible by crossover and/or mutation are remedied by either resolving the infeasibility or replacing the candidate solution. Feasible candidate solutions in our case are images that the reference neural network classifies with a different label than the ground truth label of the reference MNIST image for which it is being derived. When crossover and mutation cause a candidate solution to be infeasible because it has the same label as the reference MNIST image, we replace those offspring with one of their parents with equal probability. Besides having assurance that the parent is feasible, the parent might be likely to have “good” genetic content if their offspring had the same label as the reference MNIST image. And so, this tactic might also act a form of elitist selection, thereby increasing selective pressure [Bäck and Hoffmeister 1991]. But, the primary motive for this tactic is to maintain a population of feasible candidate solutions rather than relying on chance that some AEs will be feasible or necessitating development of a scheme to resolve infeasible solutions while somehow not degrading their fitness.

Two classes of GAs have been defined where, in generational algorithms, all population members are replaced in each generation and, in steady state algorithms, only a few members are replaced in each generation [Syswerda 1991]. Our algorithm leans toward the former class of algorithms because all population members are replaced each generation, except those retained through elitist selection.

3.2.4 Reproduction and Crossover.

Effectively balancing exploration of the solution space and exploitation of current candidate solutions is of critical importance in the design of GAs. Excessive exploitation, for example, increases selective pressure and may cause premature convergence without appropriate exploration of alternate solutions. One design feature of our algorithm that favors exploration is that most offspring are created using crossover. Our approach was to keep the algorithm as simple as possible, so we used a single-point crossover with the crossover point uniformly distributed across the phenotype. This crossover mechanism motivated the flattened form of the MNIST image array to be used in the GA.

We denote the uniform crossover operator for AEs \(v\) and \(w\) as \( \begin{equation*} c \left(x^{v \prime } , x^{w \prime } \right) = \lbrack x^{v \prime }\lbrack 0 , d \rbrack , x^{w \prime }\lbrack d , 784 \rbrack {]}, \end{equation*} \) where \({\bf d}\) is an integer randomly selected from \([0, 784]\) with uniform probability on each integer, \(\lbrack \cdot , \cdot \rbrack\) denotes the concatenation of its arguments, and \(x \lbrack b , e \rbrack\) indicates elements \(b\) through \(e-1\) as in a Python slice.

3.2.5 Mutation Operators and Population Initialization.

We evaluated two mutation operators, RAND and MAD_Dist. RAND is naïve because it uses no contextual knowledge, whereas MAD_Dist is based on the MNIST dataset. RAND mutates an allele by simply replacing it with a uniform random variable on \(\lbrack 0.0, 1.0 \rbrack\).

The intent of MAD_Dist mutation is to minimize the number of pixels that are obviously altered by sampling from the distribution of pixel values for the pixel index being mutated from among those MNIST images with the same ground truth label. Its effect is that if all MNIST images have black borders, for example, then a border pixel mutated with MAD_Dist will be black. This method might render minimally conspicuous perturbations to reference MNIST images if the pixel value distributions vary among the different digits, as we might suspect they do. The effectiveness of this approach is still, nonetheless, subject to variation in handwriting among images of a particular classification.

Stated mathematically, if \({\mathcal {V}}_j\) is a set of the unique values in the MNIST training set at pixel position \(j\) and \({\mathcal {D}}_l \equiv \lbrace i \in {\mathcal {D}} | L (x^i) = l \rbrace\) is the set of indices for MNIST images having a label of \(l\), then the probability of pixel \(j\) being mutated to value \(v\) when generating an AE for MNIST image \(x^i\) is (5) \(\begin{equation} p_{i,j} \left(v \right) = \frac{\sum _{m \in {\mathcal {D}}_{L \left(x^i \right)}}{{\mathcal {I}}{\left(x^m_j , v \right)}}}{| {\mathcal {D}}_{L \left(x^i \right)}|} , v \in {\mathcal {V}}_j , \end{equation}\) where \({\mathcal {I}}{\left(w , z \right)} = 1\) if \(w=z\) and 0 otherwise.

Exploration and diversity were promoted by mutating each offspring phenotype with a probability of 0.7. Within a phenotype selected for mutation, the expected number of pixels mutated was 2. (These parameters were selected based on experimentation.) Specifically, a uniform random variable \({\bf urv}\) on \(\lbrack 0.0 , 1.0 \rbrack\) was selected for each allele, which was mutated independently if its respective random variable was \(\frac{2}{784}\) or less. These crossover and mutation rates, which might be considered to be high, supported diversity and balanced other design features that favored exploitation, including creating a minority of offspring by elitist selection as discussed previously.

Phenotypes were initialized pixel by pixel in each scenario using the mutation operator employed in that scenario. Phenotypes thus initialized were re-initialized if the label predicted by the reference neural network concurred with the ground truth label of the MNIST image for which an AE was being generated.

3.2.6 Reference Neural Networks.

The GA requires a neural network to classify members of the population to ensure they are feasible—that is, that their labels do not coincide with the ground truth label of the reference MNIST image. Upon initialization of the population, phenotypes are replaced if the reference neural network infers the same label as the target MNIST image, as described previously. Similarly, phenotypes are replaced if crossover and mutation yield offspring with the same classification as the MNIST image. AEs are specific to a particular reference neural network because one neural network might misclassify an input, whereas another might not. Thus, the characteristics of AEs will vary depending on the neural network employed. AEs are thus generated relative to a particular MNIST image and reference neural network.

We used two neural network architectures within the GAs. A Feedforward (FF) neural network and a Convolutional Neural Network (CNN) were motivated by a review of well-performing networks [Ciresan et al. 2011; Digit Recognizer 2020; LeCun et al. 2020]. The FF neural network had four layers with Batch Normalization (BN). Using common parlance in the neural network literature, its designation is 784-800-BN-400-BN-10. The CNN had eight layers: 1-32-BN-32-BN-P-64-BN-64-BN-P-256-BN-100-BN-10, where P denotes MaxPooling. We tried multiple training protocols with these networks including using just the MNIST training data, using adversarial training where the training data was augmented with AEs, and by elasticity deforming the training data. Ultimately, we used adversarial training with the FF network and elastically deformed images with the CNN. We verified the accuracy of these networks by training each network 100 times, as reported in Table 2.

Table 2.

Network	Average	Standard Deviation
CNN	0.995	0.00050
FF	0.986	0.00069

View Table

Table 2. Average and Standard Deviation of Training Accuracy for the Neural Networks

In an FF neural network, information flows linearly from the input of the network, through intermediate computations used to define a function, and finally to the output. There are no feedback connections whereby outputs of the model are fed back into the model. A CNN classifies an image by assigning importance to various sub-patterns of pixels within the input image. The architecture of a CNN is analogous to that of the connectivity pattern of neurons in the human brain and was inspired by the organization of the visual cortex. CNNs are often used to classify images, so one might conjecture that they might also yield better AEs when used with a GA.

3.2.7 Remaining GA Parameters.

A population size of 1,000 provided sufficient diversity, whereas larger populations increased computation time without an apparent improvement in the quality of the AEs. Smaller populations resulted in degraded image quality. Similarly, 2,000 generations were sufficient to generate visually plausible adversarial images with many parameter sets, and a larger number of generations did not improve image quality.

3.2.8 Algorithm Summary.

The following bullet points summarize the GA parameters and a formal statement of our algorithm in shown in Algorithm 2. Although Algorithm 2 indicates multiply-nested loops, these operations were coded in a more efficient manner using the Python numpy package. The computational complexity of the GA is \({\mathcal {O}} \left(G N \right),\) where \(G\) is the number of generations and \(N\) is thepopulation size:

Population size: 1,000 AEs
Number of generations: \(G = 2,\!000\) generations
Phenotype representation: One-dimensional array \(x^k\) of 784 pixels, indexed by \(k \in {\mathcal {C}} \equiv \lbrack 1 , 2000 \rbrack\)
Allele representation: Floating-point values in phenotype \(k\), \(x^{k \prime }_j\) on \(\left[ 0.0, 1.0 \right]\) for \(j \in {\mathcal {P}}\)
Fitness functions: \(f_1^L (\cdot), f_1^R (\cdot), f_2^L (\cdot), f_2^R (\cdot), f_\infty ^L (\cdot), f_\infty ^R (\cdot), f_{\texttt {MAD}}^L\), and \(f_{\texttt {MAD}}^R \left(\cdot \right)\)
Selection operators: Proportionate fitness, linear ranking, nonlinear ranking
Crossover probability: 1.0
Crossover operator: Uniformly distributed single-point crossover
Mutation probability (phenotype): Offspring were mutated with a probability of 0.7
Mutation probability (allele): Each allele was mutated, independently, with probability of \(\frac{2}{784}\) in phenotypes selected for mutation
Mutation operators: MAD_Dist and RAND
Phenotype feasibility: Enforced at initialization and after crossover and mutation
Population initialization: Each pixel was initialized using the mutation operator within a parameter set
Elitist selection: The best-fit candidate solution for each possible MNIST classification (0–9) are held over into the next generation
Minimum MAD: \(\underline{\mbox{MAD}} = 0.1.\)

3.2.9 Experimental Trials.

Experiments can be run for all combinations of the fitness, selection, and mutation methods we have outlined, but some are redundant. For example, the rank of population members is the same regardless of whether the linear or reciprocal form of a fitness functions is used, so the ranking selection methods need be run only for either the reciprocal or linear form of a fitness function. Table 3 indicates with an “X” each of 16 combinations of selection operators and fitness functions that avoid this redundancy. Specifically, the merged cells in the rows indicate where linear and reciprocal fitness forms are equivalent. We evaluated each of these scenarios in combination with two mutation operators, RAND and MAD_Dist, and two neural networks, FF and CNN. In total, then, we generated AEs with GAs for 64 parameter sets.

Table 3.

	Fitness Function
Selection Operator	\(f_1^L\)	\(f_1^R\)	\(f_2^L\)	\(f_2^R\)	\(f_{inf}^L\)	\(f_{inf}^R\)	\(f_{\texttt {MAD}}^L\)	\(f_{\texttt {MAD}}^R\)
Proportionate	X	X	X	X	X	X	X	X
Ranking-Linear	X		X		X		X
Ranking-Nonlinear	X		X		X		X

View Table

Table 3. Selection Operator and Fitness Function Combinations

Running each of the 64 scenarios for all 60,000 MNIST images represents considerable computation time, and in any case, generating AEs for a smaller number of randomly selected MNIST images suffices for our survey of human subjects. Accordingly, we generated AEs for 500 randomly selected images, for each of the 64 scenarios.

3.2.10 Implementation.

The GA was implemented in Python using a parallel processing architecture: a “manager” program instantiated multiple “worker” processes, each of which independently generated an AE based on a reference image from the MNIST dataset using a pool of CPU cores on the high-performance computing cluster at William & Mary ( https://www.wm.edu/it/rc). The code used TensorFlow version 1.14. Each computing node contained two Intel Xeon E5-2640 v4 Broadwell 2.4-GHz chips with 20 cores. Although it is typical to use Graphical Processing Units (GPUs) with neural networks, a GPU cannot execute multiple threads simultaneously, and we found that parallel execution with multiple CPUs was faster than generating one AE at a time on a GPU. The code for the GA can be found online at https://github.com/jrb28/telo-2022-16 and is archived at https://zenodo.org/record/7562791.

4 BENCHMARK GENERATIVE METHODS

We compared AEs generated with GAs with existing state-of the-art methods for generating AEs. Gradient-based methods are most common, which include a constrained version of the Broyden–Fletcher–Goldfarb–Shanno algorithm [Szegedy et al. 2014], the FGSM [Goodfellow et al. 2014], the PGD method [Madry et al. 2017], the DeepFool method [Moosavi-Dezfooli et al. 2016], and various approaches proposed by Carlini and Wagner [2017]. We compare a GA with the FGSM and PGD methods, which perhaps are the most widely cited of these methods.

A thumbnail sketch of the FGSM is shown in Algorithm 3, which incorporates an initial random step as did Madry et al. [2017]and Tramèr et al. [2017]. Noise is implemented using \(u\) \unboldmath to denote a 784-element vector with every element being a uniformly distributed random variable on \([ -1.0, 1.0]\). After the initial random perturbation, a step of size \(\alpha\) is taken in a direction that is intended to reduce the probability of the image being correctly classified using the sign of the gradient of the loss function \(\nabla _x L \left(\theta , x, y \right)\), where \(\theta\) are the neural network weights, \(x\) is the input, and \(y\) is the output. Note that the step is not exactly in the direction of the gradient contrary to a literal interpretation of FGSM because the signs of the gradient elements are used rather than their magnitudes.

Note that neither the FGSM nor the PGD method includes a check on whether an AE is classified differently than its MNIST reference image and therefore cannot guarantee a feasible AE. Goodfellow et al. [2014, p. 4] note that the scale of the gradient-sign step must be relatively large, \(\alpha =0.25\), to cause a 99% classification error rate with a logistic regression model, which results in AEs that would arguably be perceived as having been manipulated relative to the original MNIST images. With steps of smaller magnitudes, such as \(\alpha =0.01\), we found that only 0.29% of AEs were classified differently than their reference MNIST images although the images would likely be perceived as having little or no distortion relative to MNIST images. Further, we found that 18.1% of AEs were feasible with \(\alpha =0.10\). Our focus on creating adversarial images that were classified differently than their reference images while also being a small distance from the reference images motivated us to construct a variant of the FGSM that accomplished both goals by taking the smallest gradient-sign step possible to effect a change in classification. Rather than taking a fixed step size as in the original FGSM, the algorithm dynamically adjusts the step size to each particular reference MNIST image. As described in Algorithm 4, the algorithm starts with a small step of \(\alpha = 0.01\) and, if the classification of the AE has not changed, iteratively increases the step size by the same amount (0.01) until the AE \(x^\prime\) is classified differently than the reference MNIST image or until the maximum step size is reached. The percentage of feasible AEs generated by this algorithm was 98.2%, with a mean, median, and standard deviation of step size equal to 0.1696, 0.1200, and 0.1664, respectively.

The PGD algorithm, as shown in Algorithm 5, can be viewed as an extension of the FGSM because it takes a predetermined number of gradient steps of the same type as in the FGSM while reassessing the gradient prior to each step. The PGD method “clips” pixel values to keep them within the feasible \([0.0,1.0]\) range, as does the FGSM. In addition, the PGD method also constrains each pixel value to be within \(\epsilon\) of the reference image pixels with the intent of preserving visual similarity with the reference image. In so doing, Madry et al. [2017] are using the \(\ell _\infty\) norm as a constraint. The authors use a step size of \(\alpha = 0.1\) and a maximum permitted absolute difference of \(\epsilon = 0.3\), as do we. Notation \({\bf 1}\) in Algorithm 5 is a vector of 784 elements with value 1. The minimum and maximum functions are to be interpreted as element-wise operations. The PGD method, like the FGSM, does not literally use a projected gradient because it uses only the sign of the gradient but not its magnitude and, rather than projecting the perturbed image onto the feasible space, the pixels are clipped. We used code provided by the authors for the PGD algorithm [Madry et al. 2020].

5 EXPERIMENT FOR ASSESSING HUMAN PERCEPTION OF VISUAL SIMILARITY

5.1 Experiment Description

We constructed an Internet survey to gather data on how similar humans perceive AEs to be with their reference MNIST images upon which they are based. The survey prompted participants to compare pairs of images in a browser, one of which was a MNIST image and the other an AE derived from that MNIST image from one of the 66 methods described earlier (64 GA parameter sets and two gradient methods). (Figure 2(a) presents an example.) Participants were given a binary choice of whether the two images were “quite similar” or “quite different.” Each participant responded to 2,000 such comparisons, which, according to data collected in the prototyping phase of design, was expected to take between 1.5 and 2.0 hours. Each comparison was rendered after randomly selecting one of the 500 MNIST image indexes with uniform probability and one of the 66 generative methods with uniform probability. The expected number of observations for all combinations of GA parameter sets and MNIST images are equal, although some variation will be observed due to random sampling.

5.2 Survey Design Rationale

5.2.1 Survey Question Scale.

Likert scales are often used to measure participant perception. We considered Likert scales with varying numbers of responses but settled on a binary, 2-point Likert scale because it best fit the task at hand. Although a greater number of responses from which to choose might, on its face, permit greater precision, we chose a binary response primarily because it had the potential of making responding easier and faster, thus reducing survey fatigue. These data also permit clear statistical analysis.

5.2.2 Data Integrity.

We anticipated that participants might have trouble keeping focused on the (repetitive) task, so we incorporated a mechanism to mitigate that tendency. This device was implemented by showing two red squares at random intervals rather than two MNIST-based images, as shown in Figure 2(b), to which the participants were instructed to respond by clicking on the button labeled “these images are red” rather than either of the responses that are relevant for comparing two images of digits. This alternate dialogue was shown, in expectation, 5% of the time, and it implicitly communicated to the participants that the researchers would be paying attention to whether their responses made sense and also provided a signal upon which some participants’ responses could be omitted from the analysis if a sufficient frequency of incorrect responses to this check suggested that an individual’s responses were unreliable. Training included a familiarization to this check as well as familiarization and practice opportunities for the screens with comparative images. Our goal with training was to not only familiarize participants with the task but also to permit them to develop their own sense of what “quite similar” and “quite different” meant. In that vein, training also provided a warm-up period from which data was not collected. The training required approximately 5 minutes to complete.

We used these responses as an indication of whether a participant was being sufficiently attentive to the survey, and in particular, we chose a rate of 10% incorrect responses as the threshold for including participants’ responses in our analysis. Our analysis omits the responses of 11 of the 109 participants whose responses to red squares were incorrect at a rate of 10% or greater.

5.2.3 Survey Pool.

We obtained approval from our institution’s Protection of Human Subjects Committee (PHSC), and we collected responses from 109 undergraduate and graduate students. Participants in the former group were taking a major or minor in business administration, whereas the latter group were from a Masters of Science in Business Analytics program. The identities of participants in both groups were not known to the researchers. Both groups received extra credit in a course in return for their participation.

6 ANALYSIS: VISUAL SIMILARITY OF AE GENERATIVE METHODS FOR MNIST

This section contains results for the relative effectiveness of selection operators, mutation operators, fitness functions, and neural networks in generating AEs that are visually similar to original MNIST images upon which they are based. Descriptive and inferential results not referenced in this section are contained in Appendix A.1. We refer to specific combinations of parameters using a shorthand to indicate a neural network, a fitness function, a selection operator, and a mutation operator, in that order, using the acronyms from Table 4. The specification FF-L2-RL-R, for example, refers to a GA using an FF neural network, the 2-norm as a fitness function, a linear ranking scheme for selection, and random mutation.

Table 4.

View Table

Table 4. GA Parameter Acronyms

In the sections that follow, we discuss, in this order, a summary of the results, a comparison of GA and gradient methods, and an analysis of the best-performing GA parameters.

6.1 Overall Summary of Results

Table 5 shows the proportion of responses where participants indicated that the AEs were “quite similar” to the reference MNIST image upon which they were based, as well as the number of observations (\(n\)) for each parameter set, which ranged from 2,472 to 2,710. The mean \(n\) for a parameter set was approximately 2,582. The range of the proportion of “similar” responses among GA parameter sets is significant, from a maximum of approximately 98.9% to a minimum of 2.4%. GAs using the best-performing parameter combinations can therefore clearly generate AEs that are perceived as being similar to MNIST images: 11 parameter sets resulted in proportions of 0.98 or greater, and 30 parameter sets were associated with proportions of 0.95 and higher. Conversely, some GA parameter combinations performed poorly, particularly those whose indices were 41 and greater, as is indicated by the precipitous decline in proportion of “similar” responses in Figure 3 with the CNN-L2-P-R parameter set at index 41. Note, as mentioned previously, that the equivalence of reciprocal and linear fitness functions with ranking-based selection operators, so that L1, L2, and M fitness functions could be replaced with L1L, L2L, and ML, respectively, in Table 5 when ranking selection methods RL and RNL are used.

Fig. 3. Proportion of “similar” responses of the parameter sets from Table 5.

Table 5.

Index	GA/Method	\(n\)	Proportion	Index	GA/Method	\(n\)	Proportion
1	FF-L2-RNL-R	2639	0.989	34	FF-L1L-P-M	2498	0.819
2	CNN-L1-RNL-R	2646	0.983	35	CNN-L2L-P-M	2478	0.811
3	FF-L1-RNL-R	2627	0.983	36	CNN-ML-P-M	2628	0.809
4	CNN-L1-RL-R	2677	0.983	37	FF-ML-P-M	2525	0.801
5	CNN-ML-RNL-R	2541	0.982	38	FF-L2L-P-M	2631	0.790
6	CNN-L2-RL-R	2483	0.981	39	CNN-Linf-RNL-R	2667	0.745
7	FF-L2-RL-M	2558	0.981	40	FF-Linf-RNL-R	2472	0.744
8	FF-ML-RNL-R	2538	0.980	41	CNN-L2-P-R	2614	0.262
9	CNN-L2-RNL-R	2525	0.980	42	FF-L2-P-R	2611	0.200
10	CNN-ML-RL-R	2641	0.980	43	FF-Linf-RL-R	2639	0.135
11	FF-ML-RL-R	2637	0.980	44	CNN-Linf-RL-R	2614	0.134
12	FF-L1-RL-R	2555	0.979	45	CNN-L1-P-R	2614	0.124
13	CNN-L2-RNL-M	2548	0.976	46	CNN-M-P-R	2616	0.104
14	FF-L2-RNL-M	2710	0.975	47	FF-L1-P-R	2516	0.098
15	CNN-L2-P-M	2542	0.975	48	FF-M-P-R	2644	0.093
16	CNN-L2-RL-M	2529	0.975	49	CNN-L1L-P-R	2656	0.073
17	FF-L2-RL-R	2613	0.974	50	CNN-ML-P-R	2495	0.065
18	FF-L2-P-M	2552	0.973	51	FF-L1L-P-R	2498	0.061
19	FF-L1-RL-M	2591	0.971	52	FF-ML-P-R	2641	0.059
20	FF-L1-P-M	2571	0.970	53	FF-L2L-P-R	2647	0.047
21	CNN-ML-RNL-M	2619	0.970	54	CNN-L2L-P-R	2492	0.046
22	CNN-L1-RL-M	2524	0.969	55	CNN-LinfL-P-M	2614	0.042
23	FF-L1-RNL-M	2583	0.969	56	FF-Linf-P-M	2508	0.041
24	FF-M-P-M	2647	0.969	57	CNN-Linf-RNL-M	2554	0.039
25	CNN-L1-P-M	2615	0.969	58	CNN-Linf-P-M	2555	0.039
26	FF-ML-RL-M	2593	0.968	59	FF-LinfL-P-M	2558	0.039
27	CNN-ML-RL-M	2558	0.968	60	CNN-Linf-RL-M	2632	0.038
28	CNN-L1-RNL-M	2666	0.964	61	FF-Linf-RNL-M	2610	0.037
29	FF-ML-RNL-M	2539	0.964	62	FF-Linf-RL-M	2584	0.037
30	CNN-M-P-M	2568	0.957	63	FF-LinfL-P-R	2602	0.035
31	FGSM	2566	0.898	64	CNN-LinfL-P-R	2516	0.034
32	PGD	2685	0.860	65	CNN-Linf-P-R	2526	0.033
33	CNN-L1L-P-M	2485	0.829	66	FF-Linf-P-R	2584	0.024

View Table

Table 5. Proportion of Responses Indicating AEs “Quite Similar” to Basis MNIST Images

6.2 Statistical Significance of Proportions

We computed inferential statistics to determine whether the differences in proportions of responses indicating “quite similar” among the AE generation methods were meaningful—that is, whether the differences were statistically significant. We report in this section on hypothesis tests we conducted between all combinations of GA parameter sets and gradient methods. We tested for differences in proportions while correcting for the multiple hypothesis tests performed with the same dataset using the Holm-Bonferroni Correction (HBC) method. The HBC reduces the possibility of a spurious type I error, albeit with a loss of statistical power. We also computed uncorrected hypothesis tests using a test of proportions and chi square tests are described in Appendix A.1.

Figure 4 shows the statistical significance of the differences in proportions of “quite similar” responses for each pair of parameter sets. The axes labels of Figure 4 are the indices of the parameter sets as shown in Table 5 and the color of each intersecting row and column indicates whether the difference in proportions for two parameter sets are statistically significant: a black cell indicates a statistically significant difference at the \(\alpha =0.05\) level, whereas white indicates no statistically significant difference. One can determine all parameter sets with no statistically significant difference from a particular parameter set by observing within its row (or column) which cells are shaded white. The parameter sets corresponding to the white cells in that row have statistically equivalent proportions. The square regions that are shaded white along the diagonal of Figure 4 whose diagonals coincide with the diagonal of the figure at large, and which have been highlighted with colored borders, indicate subsets of the parameter sets with indiscernible differences in their proportions. We call each of these an equivalence class. These often intersecting equivalence classes can also be displayed as a Venn diagram as shown in Figure 5 for the best three equivalence classes.

Fig. 4. Map of statistically significant differences in proportions of “quite similar” responses among GA parameter sets and gradient methods using tests of proportions with the HBC method. Parameter set indices are shown on the \(x\) -axis and \(y\) -axis, which refer to the indices in the first and fourth columns of Table 5. Cells shaded black indicate statistically significant differences, whereas cells shaded white indicate no significant difference.

Fig. 5. Venn diagram of the first three equivalence classes.

Although the HBC reduces type I error, it simultaneously enlarges some equivalence classes relative to the uncorrected analysis. The FGSM and PGD gradient methods (indices 31 and 32) are the sole members in their equivalence classes in all statistical tests, corrected or otherwise, due to substantial differences in their proportions and the proportions of the GA parameter combinations above and below them in Table 5. Their performance provide a distinct boundary between the best-performing GA parameter sets and the worst.

6.3 GAs versus Gradient Methods

The AEs from GAs for 30 parameter sets were judged by participants to be more visually similar to MNIST images than were the FGSM and PGD gradient methods. Moreover, there is a clear separation between the best GAs and the gradient methods: the proportions of “quite similar” responses for the FGSM and the PGD method were more than 5% less than the proportion for the 30th best GA in Table 5. Figure 6 gives a visual comparison of AEs from competing methods for 10 randomly selected MNIST images. In each row, a reference MNIST image is rendered, then four AEs from best to worst as measured by the proportions of “similar” responses. From left to right in each row, AEs are shown for the best-performing GA parameter set, the two gradient methods, and, finally, the worst-performing GA parameter set.

Fig. 6. Comparison of MNIST images with AEs generated with various methods.

The frequency histograms of the absolute differences in pixel values between AEs and their reference MNIST images shown in Figure 7 quantitatively display the differences between AEs and the reference images rendered in Figure 6. The best GA parameter set yields absolute pixel differences that have the lowest mean and dispersion. The distribution of absolute pixel value differences for the worst GA with the fitness function based on \(\ell _\infty\) is nearly uniform, with much greater dispersion. The distributions for the gradient methods still have considerably more variance than the best GA, which results in many gray pixels that may be perceived as noise: the FGSM and the PGD method create numerous gray pixels in the black backgrounds of the images and also in the field of the digit where white pixels are dominant. The GA, in contrast, changes fewer pixel values and to lesser degrees so that the deviations are less prominent.

Fig. 7. Frequency histogram of absolute differences of pixel values relative to MNIST images.

6.4 Best-Performing GA Parameters

We report in this section on findings based on descriptive and inferential analysis of the survey data regarding which neural networks, fitness functions, selection operators, and mutation operators yield AEs that are perceived to be most similar to MNIST images. Details of the statistics are included in Appendix A.1. The specific findings that we discuss in the following sections, which are based on analysis of all parameter combinations, are as follows:

(1)	An FF neural network is preferred because its AEs receive comparable proportions of “quite similar” responses and computation time is less than with a CNN.
(2)	L2 is the best-performing fitness function.
(3)	RNL is the best-performing selection operator.
(4)	Random mutation is preferred over MAD mutation.

Indeed, the parameter combination scoring the greatest proportion of “quite similar” responses, FF-L2-RNL-R, employs these operators.

6.4.1 Reference Neural Network.

CNNs are often used to classify images because their accuracy is often better than that of other neural network architectures such as simpler FF networks, so we might conjecture that a CNN would facilitate more visually similar AEs than FF networks. We find, however, no significant evidence that CNNs perform better than FF networks in that regard. Moreover, the GA computation time is significantly less for the smaller FF network (as reported in Section A.1.7), so using an FF network is the dominant choice for a reference neural network.

6.4.2 Fitness Function.

We tested the six pairwise combinations of the three best-performing fitness functions, which were L2, L1, and ML. None of the differences were statistically significant at the \(\alpha = 0.05\) level. Although the performance of these three fitness functions is statistically equivalent, we note that the L2 fitness function performed best with a proportion of “quite similar” that was perhaps a nontrivial 0.023 greater than the next best alternative, which is L1. Thus, one might prefer L2 based on descriptive statistics.

6.4.3 Selection Operator.

The two best selection operators were RNL and RL. The difference in these two operators’ proportion of “quite similar” responses was not statistically significant, although a regression analysis indicated that the proportion of “quite similar” responses for RNL was 0.077 greater than for RL. The RNL operator is therefore preferred.

6.4.4 Mutation Operator.

A combination of statistical tests and descriptive statistics indicate that random mutation is preferred over MAD_Dist mutation. Notably, 11 of the 12 best parameter sets in Table 5 employ random mutation.

7 GA FOR THE CIFAR-10 DATASET

7.1 CIFAR-10 Dataset

We investigate generating AEs with a GA for the CIFAR-10 dataset [Krizhevsky et al. 2022] because it offers the opportunity to investigate whether the GA developed for MNIST grayscale images can be generalized to CIFAR-10 color images. We will evaluate whether the fitness functions, selection operators, and mutation operators can be generalized to the CIFAR-10 dataset and whether the same parameters that are best for MNIST are also best for CIFAR-10. Whereas MNIST images are grayscale and represented by two-dimensional arrays, CIFAR-10 color images are represented by three-dimensional arrays with three layers along the third dimension for the intensities of RGB (red, green, and blue) pixels. Each image, specifically, is a 32 \(\times\) 32 \(\times\) 3 array with values on \([0,255]\). Similar to our treatment of the MNIST images, we converted the integer values to floating-point values to accommodate greater accuracy with the reference neural network, so each value in the array is on \([0.0 , 1.0]\). Like the MNIST dataset, CIFAR-10 is reasonably well studied and prior research has produced neural networks that could be used within a GA.

The CIFAR-10 training set is composed of 50,000 images, each of which has an associated label (classification) from 1 of 10 possible categories, whose labels are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. Figure 8 displays sample images from each category.

Fig. 8. CIFAR-10 examples for each of its 10 classifications.

7.2 A GA for Generating AEs

Our goal is to construct a GA for CIFAR-10 images by revising the MNIST algorithm to the least extent possible to assess its generalizability for generating AEs across datasets. First, the fitness functions and selection operators need no modification and can be applied directly, albeit on inputs with different dimensions. Mutation operators can be generalized in multiple ways for arrays with a greater number of dimensions, and we describe two such methods that are analogous with the mutation method used with MNIST images. The initialization of the population uses the most promising of these two mutation methods. The description of the GA in subsequent sections starts with mutation because its adaptation from MNIST to the CIFAR-10 dataset requires the most discussion and because the method for initializing the population depends on it.

7.2.1 Mutation Operators.

We tried two methods for generalizing random mutation from the MNIST dataset to CIFAR-10. We first investigated converting RGB-encoded pixels to an alternate color model, which is called the HSV (hue, saturation, and value) model. Grayscale images encoded in RGB for black, white, and medium gray, for example, are \((0.0,0.0,0.0)\), \((1.0,1.0,1.0),\) and \((0.5,0.5,0.5)\), respectively. In the HSV model, where hue describes the color, saturation describes the intensity of gray, and the value represents the brightness, the grayscale colors black, white, and medium gray are encoded as \((0.0,0.0,0.0)\), \((0.0,0.0,1.0)\), and \((0.0,0.0,0.5)\), respectively. Brightness is the only parameter that varies with HSV encodings of grayscale hues, so the mutation method used with MNIST could have been implemented equivalently using HSV by mutating the brightness value on \(\lbrack 0.0 , 1.0 \rbrack\). Using the HSV model to hold constant hue and saturation while mutating brightness is enticing for the CIFAR-10 dataset because it might permit a mutated pixel to be more consistent with the reference image color palette and, locally, with its neighboring pixels than if all three values in the RGB model were mutated, as is described next.

The second approach we investigated for mutating pixels, which might also be considered analogous with the method used with MNIST, is to choose three uniform random variables on \(\lbrack 0.0 , 1.0 \rbrack\), one for each of the RGB intensities. We found that this method, visually, was inferior to mutating brightness in the HSV model, because mutated pixels often seemed obviously out of place due to their hues being distinctly different from the surrounding pixels. Using brightness mutation with the HSV model does provide a better match with the color palette of the reference image, so we employed that method.

We did depart from the RAND mutation method used with MNIST images in that the brightness was not randomly selected but, instead, the pixel brightness of the reference CIFAR-10 image was randomly perturbed. This mutation variant could have been used with MNIST data, and we view it as a possible algorithmic improvement there rather than an indication that the same GA is not generally applicable. We found in sample runs that pixels thus mutated were less conspicuous than pixels whose brightness was randomly selected from the entire range \(\lbrack 0.0, 1.0 \rbrack\). We call this PRand mutation because pixels are mutated by Perturbing the reference image pixel by a Random differential, which can be either positive or negative. The details of PRand are described in the following after a description of some necessary preliminary notation.

We denote a CIFAR-10 image with the same notation used for MNIST with modifications for the different image array shape including that, now, \({\mathcal {P}} = \left\lbrace 1, \ldots , 3072 \right\rbrace\). Additionally, similarly with MNIST, \(x^i\) denotes the CIFAR-10 image with index \(i\), although now \(x^i \in \left(h , s , v \right)^{1024}\), where \(h,s,v \in \lbrack 0.0 , 1.0 \rbrack\) to accommodate the hue, saturation, and brightness values in the HSV model. Furthermore, denote the hue, saturation, and brightness of pixel \(j\) in CIFAR-10 image \(i\) by \(h^i_j , s^i_j , {\rm{and}}\, v^i_j\),respectively. We denote an AE for a CIFAR-10 image, by \(x^{\prime }\), as we did with MNIST, with the hue, saturation, and brightness values of its \(j\)-th pixel being \(h^{\prime }_j , s^{\prime }_j , {\rm{and}}\, v^{\prime }_j\), respectively. If pixel \(j\) of a CIFAR-10 image \(i\) is being mutated, then the mutated brightness element is (6) \(\begin{equation} v^{\prime }_j = \min { \left(\max {\left(v^i_j + \epsilon (0.5 - {\bf urv}), 0.0 \right)} , 1.0 \right)} , \end{equation}\) where \(\epsilon\) is the scale of the mutation. Furthermore, hue and saturation remain unchanged in the mutation: \(h^{\prime }_j = h^i_j, s^{\prime }_j = s^i_j\). Note that pixel brightness values are clipped to maintain feasible values on the interval \([0.0, 1.0]\)

The \(\texttt {MAD_Dist}\) mutation method suffered the same downside as did random mutation based on the RGB model where mutated pixels were often obvious because the dominant hues vary significantly among CIFAR-10 images. Although the hues in one image may be skewed toward browns and reds, another may be skewed toward yellows and greens such that a pixel selected from the former image for the mutation of a pixel in the latter image would stand out. We consequently chose not to explore the viability of \(\texttt {MAD_Dist}\) mutation for CIFAR-10 images. The \(\texttt {MAD_Dist}\) mutation method is one operator that was reasonably successful with the MNIST dataset that does not translate well to CIFAR-10.

7.3 Population Initialization

The population was initialized using a variation of the method used for MNIST images with the intent of minimizing the difference in brightness between the reference image and the AE. Rather than selecting a random brightness value on \([0.0, 1.0]\) for every pixel, a subset of pixels were mutated using PRand mutation. The goal of this approach was to realize an initial population with better fitness that, subsequently, should facilitate better resulting AEs with fewer generations being required. This approach could be applied to MNIST images as well, so it represents an algorithmic improvement rather than being an indication that the MNIST GA is not generalizable. The population initialization routine described in the following is summarized in Algorithm 6.

The basis of the initialization routine is an observation that AEs can sometimes be generated by mutating one pixel from the reference image, whereas in other instances, many pixels must be mutated. Based on this, the routine generates candidate AEs using multiple modes, where the number of pixels mutated varies among the modes. Specifically, the number of pixels mutated are the values in the set \({\mathcal {M}} \equiv [1, 2, 3, 4, 5, 6, 10, 50, 100]\). An (initial) scale parameter is associated with each mode \(m \in {\mathcal {M}}\), namely \({\mathcal {E}} \equiv [1.6, 1.3, 1.1, 0.9, 0.8, 0.7, 0.6, 0.45, 0.3]\). Denote by \(\epsilon {\left(m \right)}\) the scale value corresponding with some mode \(m\)—for example, \(\epsilon {\left(3 \right)} = 1.1\). These scale factors were determined through experimentation with an aim to balance the goals of mutating the fewest pixels possible while effecting a classification change with a sufficient probability to enable generation of AEs at a reasonable rate. The magnitude that pixels must be mutated varies among the CIFAR-10 images, however, so these scale factors are dynamically adjusted as described later to negotiate the tradeoff of minimizing mutation magnitude and generating a population in a reasonable amount of time. Note that the initial set of scale factors are decreasing in the number of pixels mutated, which reflects that less significant mutations are likely needed when a greater number of pixels are mutated.

The initialization routine iterates through the modes \(m \in {\mathcal {M}}\) and, for each \(m\), a batch of \(N_B = 50,\!000\) candidateAEs is created. Each candidate AE (\(x^\prime\)) is generated by first copying the reference CIFAR-10 image (\(x^i\)) and then selecting \(m\) pixel indices without replacement whose pixel brightnesses are mutated. Denote by \({\mathcal {J}}_m\) a set of indices selected for a candidate AE \(x^{\prime }\). The brightness in each pixel \(j \in {\mathcal {J}}_m\) is mutated as follows according to (6): \( \begin{equation*} v^{\prime }_j = \min { \left(\max {\left(v^i_j + \epsilon {\left(m \right)} (0.5 - {\bf urv} ), 0.0 \right)} , 1.0 \right)} \qquad \forall j \in {\mathcal {J}}_m. \end{equation*} \) Each of the candidate AEs thus generated are retained in the initial population if the reference neural network classifies them differently than the reference image (i.e. \(L (x^{\prime }) \ne L (x^i)\)) while otherwise being discarded.

Multiple iterations through \({\mathcal {M}}\) were sometimes necessary to generate N = 2,000 feasible AEs. Pixel mutation scale parameters were dynamically adjusted to minimize deviation from the reference image while generating AEs at a reasonable rate. Scale parameters were adjusted both individually by mode or all at once depending on the circumstance. If for any mode \(m \in {\mathcal {M}}\) the number of feasible AEs generated were above an upper threshold, then that scale factor \(\epsilon \left(m \right)\) was reduced by multiplying it by \(a_{dec} = 0.8\) to reduce the difference between the reference image and the candidate AEs, although they would likely be generated at a slower rate. If the number of feasible AEs was below a lower threshold, then the scale factor was increased by multiplying it by \(a_{inc} = 1.1\) to generate AEs more quickly. If the cumulative number of feasible AEs generated in one iteration through \({\mathcal {M}}\) was greater than three times the target population size, then all AEs were discarded and the routine was restarted with smaller scale factors to reduce the magnitude of mutation.

After iterating four times through the modes \(m \in {\mathcal {M}}\) with fewer than 100 feasible AEs being found, the set of modes \({\mathcal {M}}\) was expanded by \({\mathcal {M}}^+ \equiv [200, 500, 1000]\), namely \({\mathcal {M}} = {\mathcal {M}} \cup {\mathcal {M}}^+\) to permit a greater number of pixels to be mutated. Accordingly, the set \({\mathcal {E}}\) was augmented with \({\mathcal {E}}^+\), \({\mathcal {E}} = {\mathcal {E}} \cup {\mathcal {E}}^+\), with \({\mathcal {E}}^+ \equiv [0.3 , 0.3 , 0.2]\).

The population initialization terminates when the number of feasible AEs meets or exceeds the target number, \(N\). This approach takes longer than initializing an AE by mutating every pixel because many more candidate AEs are infeasible, but it creates much better AEs with each mutated pixel having a hue most often closer to the corresponding pixels in the reference image. This promotes increased visual similarity and improved fitness. The quality of the initial population created permitted the number of generations to be significantly less than for the MNIST AEs, with \(G=125\).

7.3.1 Fitness, Selection, Reproduction, Crossover, and Benchmarks.

The multiple forms of the L1, L2, Linf, and \(\texttt {MAD_Dist}\) fitness functions are directly applicable to CIFAR-10 images. We evaluated all of these in our study with the exception of \(\texttt {MAD_Dist}\) because it is associated with \(\texttt {MAD}\) mutation, which we did not evaluate because of its poor visual similarity.

The same selection operators were evaluated for CIFAR-10 as were evaluated for MNIST. A single-point crossover operator was used on flattened phenotypes, as we used with MNIST. Consistent with the GA implementation for MNIST, each new generation consists of entirely new offspring, with the exceptions for those retained for elitist selection and those parents that are retained because their offspring are infeasible.

7.4 Reference Neural Network

The neural network we used was selected from the many that were considered based on their accuracy and inference latency, which is the computation time for image classification. Researchers would, of course, prefer highly accurate neural networks, but this second criterion was especially important because neural network inference is the primary determinant of computation time in the GA because the feasibility of every offspring is tested in every generation with neural network classification, as are all candidate AEs for the initial population. One benchmark site for CIFAR-10 classification [MetaAI 2022] indicates that higher accuracy can be achieved with transformer-based architectures, where the current best network is reported to have an accuracy of 0.99 [Dosovitskiy et al. 2021]. This accuracy comes with a cost in terms of prediction latency, however. Experiments performed with transformer-based architectures indicate that inference times can be 100 to 200 times slower than the CNN architecture we ultimately used in this research. Increasing computation time by 100 times would make the GA impractical. Similar experiments were conducted with EfficientNets and ResNets, whose accuracy was higher than the CNN we used, but with similar performance disadvantages.

We ultimately selected a CNN with nine layers: 1-C32-BN-C32-BN-P-C64-BN-C64-BN-P-C128-BN-C128P-F-100-BN-10, where P is MaxPooling, C is convolutional layer, BN is batch normalization and F is a flattening layer. The ReLU activation function was used for all but the output layer, which used softmax. The RMSProp optimizer was used with an annealing schedule. This network had a validation accuracy of 0.832. To increase its accuracy, we trained the network with an elastically distorted training set, which increased its accuracy to 0.873.

7.5 Algorithm Summary

All of the parameters used with MNIST were used in the GA for the CIFAR-10 images except for thosenoted next:

Population size: \(N = 2,\!000\)
Phenotype representation: One-dimensional array \(x^k\) of 3,072 pixels, indexed by \(k \in {\mathcal {C}} \equiv \lbrack 1 , 2000 \rbrack\)
Mutation operator: PRand, uniform random variables for brightness on \(\left[ 0.0, 1.0 \right]\)
Number of generations: \(G = 125\) generations.

The GA for CIFAR-10 follows Algorithm 2 for MNIST with the exceptions noted previously. Table 6 indicates the fitness and selection operator combinations investigated with the CIFAR-10 dataset. The code for generating CIFAR-10 AEs can be found at https://github.com/jrb28/telo-2022-16 and is archived at https://zenodo.org/record/7562791.

Table 6.

	Fitness Function
Selection Operator	\(f_1^L\)	\(f_1^R\)	\(f_2^L\)	\(f_2^R\)	\(f_{inf}^L\)	\(f_{inf}^R\)
Proportionate	X	X	X	X	X	X
Ranking-Linear	X		X		X
Ranking-Nonlinear	X		X		X

View Table

Table 6. Selection Operator and Fitness Function Combinations

7.6 Benchmark Methods

We used the FGSM and the PGD method for CIFAR as benchmarks. We used the PGD code from Madry et al. [2022]. The FGSM is a special case of the PGD method with a single step rather than multiple steps, so we used the same PGD code to implement the FGSM by setting the number of steps to 1.

Table 7.

Index	GA/Method	\(n\)	Proportion	Index	GA/Method	\(n\)	Proportion
1	L2-RL	8599	0.688	8	L1-RNL	8450	0.558
2	L2-P	8310	0.656	9	L1L-P	8460	0.525
3	L2-RNL	8254	0.596	10	Linf-P	8428	0.522
4	L1-RL	8553	0.582	11	LinfL-P	8346	0.517
5	Linf-RL	8364	0.573	12	Linf-RNL	8432	0.511
6	L2L-P	8409	0.572	13	PGD	8379	0.259
7	L1-P	8343	0.570	14	FGSM	8416	0.174

View Table

Table 7. Proportion of Responses Indicating AEs “Quite Similar” to Basis CIFAR-10 Images

8 ANALYSIS: VISUAL SIMILARITY OF AE GENERATIVE METHODS FOR CIFAR-10

As with MNIST, we collected data from respondents on the visual similarity of AEs generated with the various parameter sets and omitted responses from any respondent who incorrectly misidentified the red squares in 10% or more of the instances. This section reports findings from the analysis of that data, specifically regarding the relative effectiveness of selection operators and fitness functions in generating AEs that are visually similar to the reference CIFAR-10 images upon which they are based.

Table 7 shows the proportion of responses where participants indicated that the AEs were “quite similar” to the reference CIFAR-10 image upon which they were based, as well as the number of observations (\(n\)) for each parameter set. The average \(n\) for the GA parameters and benchmark methods in Table 7 is 8,410. The highest proportions of AEs considered to be “quite similar” among the different GA combinations are lower than for MNIST images. The data does not shed light on why this might be the case, but perhaps differences between color images are easier to detect visually than with grayscale images. Alternatively, comparing color images might create a higher cognitive load such that differences might be perceived where no significant difference exists.

8.1 Overview of Results and Comparison with Benchmarks

Figure 9 shows the statistical significance of the pairwise differences between the various GA parameter sets and the benchmark methods using a test of proportions with the HBC method, which takes into account the multiple hypothesis tests that are based on the same dataset. It also shows the equivalence classes as defined in the MNIST analysis. (An analogous figure using uncorrected tests is shown in Appendix A.2.) The two parameter sets with the greatest proportion of “quite similar” responses (L2-RL and L2-P) are significantly better than the remaining alternatives in a statistical sense and statistically different from each other as well. The two methods with the lowest proportion of “quite similar” responses are the two benchmark gradient methods, PGD and FGSM. Their proportions of “quite similar” responses are statistically different from any GA parameter set and also from each other.

Fig. 9. Map of statistical significance of the test of proportions among GA parameter sets and gradient methods corrected with the Holm-Bonferroni method. Axes’ tick mark labels refer to the indices in the first and fourth columns of Table 7.

Figure 10 compares 10 randomly selected CIFAR-10 images with the AEs generated by the two best GA parameter sets and the two gradient methods. Figure 11 shows frequency histograms for the absolute pixel differences between CIFAR-10 and those same methods for the same 10 random CIFAR-10 images. The frequency histograms clearly show that the difference in pixel values from CIFAR-10 images tends to be greater for the PGD and FGSM gradient methods as is evidenced by the mean and standard deviation of absolute pixel differences.

Fig. 10. Comparison of CIFAR-10 images with AEs generated with various methods.

Fig. 11. Frequency histogram of absolute differences of pixel values relative to CIFAR-10 images with superimposed mean and standard deviation of absolute differences ( \(\mu\) and \(\sigma\) , respectively).

The execution time of the gradient methods was again, and as expected, shorter than for the GAs. The PGD method and FGSM mean execution times were 0.131 and 0.018 seconds per AE, respectively, whereas the mean computation time for the GAs is most often on the order or almost 30 minutes per AE. Details regarding computation time are contained in Appendix A.2.

8.2 Best-Performing GA Parameters

An analysis across all GA parameter sets indicates that the best-performing fitness functions were L2 and L2L, which are equivalent when paired with the best selection operator, which is RL. Indeed the best-performing individual parameter set employed these parameters, which is L2-RL or, equivalently, L2L-RL. The detailed analysis of these operators is contained in Appendix A.2.

9 DISCUSSION AND CONCLUSION

Our analysis establishes that GAs can produce AEs that humans find more visually similar to MNIST and CIFAR-10 images than those generated by gradient methods. Nearly half of the GA parameter sets used with MNIST generated AEs that were perceived to be more visually similar to MNIST images than gradient methods, and all of the CIFAR-10 GA parameter sets generated AEs that were perceived to be more visually similar to the original images than were those generated with gradient methods.

The same general GA structure was used successfully with both MNIST and CIFAR-10, with some parameter settings performing well across both datasets. The L2 fitness function performed best across datasets, and it was associated with the highest proportion of “quite similar” responses for both datasets. In addition, RL selection performed well with both datasets with no statistical difference between it and other selection operators that performed well with MNIST. Based on descriptive statistics, however, one might prefer RNL selection for MNIST data and RL for CIFAR-10 data. Random mutation also performed well for both datasets, although the random mutation mechanism RAND with MNIST was improved with the introduction of the PRand mutation operator for the CIFAR-10 dataset.

Some GA parameters, conversely, did not perform consistently across datasets, including MAD_Dist mutation, which performed well with MNIST but poorly with CIFAR-10 color images, perhaps due to the diversity of color palettes in that dataset. Additionally, it was necessary to use different neural networks for MNIST and CIFAR-10 because the neural network architectures and training weights must be tailored to the dataset. Therefore, we conclude that the basic structure of the GA is sufficiently general to handle both grayscale and color images with the incorporation of a suitable neural network and perhaps minor calibration of the selection operator.

Having identified the L2 fitness function as being correlated with greater human perception of similarity with the reference MNIST images, it might be safely used in future research with generating AEs for grayscale or color images without needing to undertake cumbersome surveys. Similarly, the best selection and mutation operators found in this study could be redeployed in similar contexts.

Besides generating AEs with superior visual similarity, another advantage of our GA over gradient methods is that it guarantees a valid AE with a classification different from the reference image, whereas gradient methods do not. Guaranteeing a feasible AE by maintaining a population of entirely feasible members might be an effective technique in other problem contexts.

The disadvantage of a GA relative to gradient methods is its slower speed. In summary, the comparative advantages of GAs are greater visual similarity and guaranteed AE feasibility, whereas the comparative advantage of gradient methods is their speed. Performance tradeoffs among different approaches are to be expected, and the tradeoff here is in speed versus visual similarity and feasibility. The relative importance of these performance dimensions depends on the purpose of the AEs. Speed is essential in adversarial training so that the training data can be augmented with a sufficient volume of data, whereas, in that context, AEs that are invalid and more distant from the reference input may be advantageous to training rather than detrimental. To the contrary, measuring neural network robustness accurately requires, first, that an AE be valid and also that it is as close as possible to the minimum perturbation required for a different classification. Counterfactuals to explain neural network output, to name another context, must also be as close as possible to the reference input while having a classification different from that of the reference input. Thus, although greater speed is always desirable, validity and precision cannot be sacrificed in these circumstances. Moreover, speed may not be as important in these two contexts if the exceptions to be explained are only small percentage of the inputs (e.g., explaining why a loan application was rejected). In any case, taking 30 minutes to generate a counterfactual may be sufficiently fast.

In addition to mathematical differences between authentic inputs and AEs, visual differences are of paramount importance in some contexts. When neural networks gather input without human intervention, generate AEs for training, guide decisions of little consequence, and the affected parties are never concerned with an explanation, then human perception of the input and AE is moot. When neural networks guide decisions of consequence or require explanation by regulation or law, then a counterfactual must be interpretable by a human vis-\(\acute{\mbox{a}}\)-vis the actual input. This is the case for decisions such as hiring and promotion, home loans, parole decisions, and sentencing—in our case, the analogous capability for humans to perceive little or no difference between reference MNIST and CIFAR-10 images and their AEs. We have validated that a GA is capable of this task and is better than gradient methods with empirical data from human subjects. This is, to our knowledge, the first study of its kind. AEs generated by the GA contained herein were shown in that study to be perceived as having greater visual similarity with authentic inputs.

The advantage of an AE generated with a GA for the purpose of providing explanation and measuring robustness is evident from Figures 6 and 7 for MNIST and Figures 10 and 11 for CIFAR-10. A counterfactual for explaining why a neural network suggested one decision when another decision was desired is keener when the divergence in outcome can be explained with fewer changes in the input. This is exactly what is possible with the best GA versus the noisier AEs from gradient methods. Similarly, a precise estimate of neural network robustness requires an AE with the minimum difference from the authentic input, which, again, is what the best GA parameter sets provide.

In summary, we find that gradient methods are appropriate for adversarial training while GAs are preferred for measuring robustness, creating counterfactuals, providing intuitive explanations for neural network output, or in any other context where humans interact with the input or output of a neural network.

AEs generated with GAs might permit advances in multiple streams of research. First, the efficacy of AEs from GAs applied to adversarial training of neural networks could be assessed. These AEs might permit neural network weights to establish clearer and more accurate boundaries between classifications. Next, as suggested earlier, work might commence to confirm that AEs from GAs do, in fact, provide a better estimation on the magnitude of perturbation that causes a neural network’s classification to change. Given the superior fitness of AEs from GAs relative to other methods investigated herein, these might reveal a neural network’s vulnerability to smaller perturbations than would other methods. The transferability of an AE from the network upon which it was generated to another neural network has been studied and the transferability of AEs from GAs might be accessed relative to AEs otherwise generated. If AEs developed on one neural network are more likely to be misclassified by another network, hacking an unknown system is more viable, which counters the possible benevolent effect of better adversarial training. Last, other measures might be evaluated for their correspondence with visual similarity including inception score, Fréchet inception distance, and the Wasserstein distance.

APPENDIX

A STATISTICAL ANALYSIS

A.1 MNIST Dataset

This section contains details of the statistical analysis for the MNIST dataset.

A.1.1 Inter-Rater Reliability.

We used Kendall’s coefficient of concordance (Kendall’s W) to evaluate the consistency of respondents’ ranking of GA parameter sets and gradient methods as determined by the proportion of “quite similar” responses. The null hypothesis of this nonparametric test is that the rankings of respondents have no correlation. The W statistic is computed and tested with a chi square sampling distribution. That test yielded \(p=0.00,\) so we reject the null hypothesis in favor of the alternate hypothesis that the respondents rankings are correlated, which indicates consistency in their visual perception of the relative similarity of AEs generated by the various parameter sets.

A.1.2 Statistical Significance of Differences in Proportions.

Before we can declare that any parameter set or individual parameter choice is better than another, we must ensure that the differences in proportions of AEs found to be “visually similar” in Table 5 are statistically meaningful. We do so in this section with two tests: a test for the difference in two population proportions and a chi square test. Although the chi square test is nonparametric, the test of proportions requires both \(np \gt 5\) and \(n \left(1- p \right) \gt 5\) for the proportions for the binomial data to be roughly normally distributed, where \(p\) is the proportion of “quite similar” responses for a parameter set and \(n\) is the number of observations for that parameter set. Each of the two tests was applied to all pairwise combinations of the GAs and gradient methods in Table 5, which constituted 4,356 hypothesis tests. We used a significance level of \(\alpha = 0.05\) for all hypothesis tests. The proportions tested in the first test are computed by dividing the number of “quite similar” responses by the total number of responses for each parameter set or gradient method. The null hypothesis of the (one-tail) test of proportion differences is that the greater of the proportions is less than or equal to the lesser proportion for the other parameter set. When the null hypothesis is rejected, it implies that we should accept the alternate hypothesis that the first, greater proportion is indeed statistically greater than second. When comparing the first two parameter sets in Table 5, for example, the null and alternative hypotheses are as follows:

\(H_0\): The proportion of responses indicating visual similarity for the FF-L2-RNL-R parameter set (0.989) is less than or equal to the proportion for the CNN-L1-RNL-R parameter set (0.983). (p-value > \(\alpha\))

Fig. A.1. Map of statistically significant differences in proportions of “quite similar” responses among GA parameter sets and gradient methods using tests of proportions. Parameter set indices are shown on the \(x\) -axis and \(y\) -axis, which refer to the indices in the first and fourth columns of Table 5. Cells shaded black indicate statistically significant differences, whereas cells shaded white indicate no significant difference.

\(H_1\): The proportion of responses indicating visual similarity for the FF-L2-RNL-R parameter set is greater than the proportion for the CNN-L1-RNL-R parameter set. (p-value \(\le \alpha\))

The chi square test indicates, less specifically, whether the distribution of responses (“quite similar” versus “quite different”) is the same or different for two parameter sets. Although it does not test whether one proportion is greater than another, we used this test as validation for the test of proportions, and we expected that both tests would be in general agreement regarding which pairs of parameter sets yielded statistically significant differences. The chi square test hypotheses statements are, as an example, for the first two parameter sets:

\(H_0\): The distribution of responses between “quite similar” and “quite different” for the FF-L2-RNL-R parameter set is the same as the distribution for the CNN-L1-RNL-R parameter set. (p-value > \(\alpha\))

\(H_1\): The distribution of responses between “quite similar” and “quite different” for the FF-L2-RNL-R parameter set are different from the distribution of responses for the CNN-L1-RNL-R parameter set. (p-value \(\le \alpha\))

Uncorrected versions of Figure 4 using a test of proportions and chi square tests are shown in Figures A.1 and A.2, which show the statistical significance of the differences in proportions of “quite similar” responses for each pair of parameter sets. The axes of these figures are the same as in Figure 4 where axis labels are the indices of the parameter sets as shown in Table 5. Again, the color of each intersecting row and column indicates whether the difference in proportions for two parameter sets are statistically significant: a black cell indicates a statistically significant difference at the \(\alpha =0.05\) level, whereas a white cell indicates no statistically significant difference. One can determine all parameter sets with no statistically significant difference from a particular parameter set by observing within the row (or column) for that parameter set which cells are shaded white. The parameter sets corresponding to the white cells in that row have statistically equivalent proportions. The square regions that are shaded white along the diagonal of Figures A.1 and A.2 whose diagonals coincide with the diagonal of the figure at large, and which have been highlighted with colored borders, indicate subsets of the parameter sets with indiscernible differences in their proportions. We call each of these an equivalence class. As with Figure 4, FGSM and PGD (indices 31 and 32) are alone in their equivalence classes due to substantial difference in their proportions and the proportions of the GA parameter combinations above and below them in Table 5. The chi square test largely coincides with the results from the test of proportions, as shown in Figure A.2.

Fig. A.2. Map of statistically significant differences in proportions of “quite similar” responses among GA parameter sets and gradient methods using a chi square test. Parameter set indices are shown on the \(x\) -axis and \(y\) -axis, which refer to the indices in the first and fourth columns of Table 5. Cells shaded black indicate statistically significant differences, whereas cells shaded white indicate no significant difference.

A.1.3 Reference Neural Network.

CNNs are often used to classify images because their accuracy is often better than that of other neural network architectures such as simpler FF networks. If CNNs outperform other neural network architectures in image recognition, then one might presume that they would detect subtle mutations more reliably than FF neural networks, which would force a GA to find even more subtle mutations than would be motivated by an FF network. Our analysis indicates that this presumption is incorrect from respondents’ perspectives as measured by tests of statistical significance.

We performed three analyses to assess the differential effectiveness of an FF network versus CNN, including two hypothesis tests and one regression analysis. We report on the regression analysis first. The regression is based on the proportions reported in Table 5 for the GA parameter sets with an additional column for an indicator variable that was coded with a 1 when an FF network was used and 0 otherwise. The slope obtained from regressing the proportion of responses indicating “quite similar” (dependent variable) against the indicator variable (independent variable) indicates, if statistically significant, the differential increase in proportion for FF relative to the proportion for CNN. The resulting \(p\)-value was 0.97, however, which indicates we cannot conclude that either the FF network or CNN has any advantage over the other.

We also performed more granular hypotheses tests by partitioning the data into 32 pairs where the two parameter sets in each pair differed only in the neural network employed with the fitness function, selection operator, and mutation method being common. For each pair, we tested the FF against the CNN for a difference of proportions of “similar” responses and also tested the raw response data for each of these pairs with a chi square test. Both tests indicated statistically significant differences in 3 of 32 data partitions, where the FF network produced more visually similar results in two of the three statistically significant results. Although this more granular view did detect some differences, it may still be fair to say that the reference neural network has little effect on the visual similarity of AEs that were generated by GAs and, certainly, CNNs do not have the significant advantage that we might hypothesize.

A.1.4 Fitness Function.

We performed regression analyses for each of the eight fitness functions, one at a time, by creating an indicator column in which a 1 was coded in the rows associated with the particular fitness function being tested and 0s in the remaining rows. Statistically significant results indicate that the fitness function coded with 1s has a higher or lower proportion of “similar” responses relative to all remaining fitness functions grouped together. The results in Table A.1 can be classified into three categories: (1) statistically significant with higher proportion of “similar” responses, (2) statistically significant with lower proportion, and (3) no statistically significant difference. The L2 and L1 fitness functions reflected a significantly higher proportion of “similar” responses at the \(\alpha = 0.05\) level, whereas the ML fitness function was marginally significant with a \(p\)-value of 0.065. The proportion of “similar” responses for the Linf and LinfL fitness functions were lower and statistically significant. Descriptive statistics support these results for the best-performing fitness functions: the top 23 parameter combinations employ L2, L1, or ML.

Table A.1.

Fitness Function	\(p\)-Value	Differential Proportion
L2	0.016	0.329
L1	0.029	0.300
M	0.032	0.2951
L1L	0.058	0.2614
L2L	0.065	0.255
ML	0.065	0.255
LinfL	0.000	–0.510
Linf	0.000	–0.512

View Table

Table A.1. Statistical Significance and Magnitude of Neural Network Effect on the Proportion of “Similar” Responses

We subsequently tested the six pairwise combinations of the L2, L1, and ML fitness functions for statistically significant differences in their proportions of “similar” responses using the same regression set up with indicator variables while omitting the data relating to the other fitness functions. None of the differences were statistically significant at the \(\alpha = 0.05\) level. Although the performances of these three fitness functions is statistically equivalent, we note that the L2 fitness function performed best with a proportion of “quite similar” that was 0.023 higher than the next best alternative, which is L1.

A.1.5 Selection Operator.

We performed a regression analysis for the selection operator, which is similar to the analysis for the fitness function, as summarized in Table A.2. We subsequently tested for differences in the RNL and RL selection operators because that initial regression indicated that the proportion of “similar” responses for RNL and RL selection was greater than for the remaining pooled methods, with the result for RNL selection being statistically significant and the result for RL selection being marginally significant. The \(p\)-value of this test was 0.552, so we conclude that there is no statistical difference in the performance of these two selection operators. Descriptive statistics support the case for RNL and RL as the best selection operators because they are employed in 14 of the 15 best parameter sets. Although the statistical difference between RNL and RL is not statistically significant, the regression analysis indicated that the proportion of “quite similar” responses was 0.077 greater for RNL.

Table A.2.

Selection Operator	\(p\)-Value	Differential Proportion
RNL	0.008	0.325
RL	0.074	0.223
P	0.000	−0.411

View Table

Table A.2. Statistical Significance and Magnitude of Selection Operator Effect on the Proportion of “Similar” Responses

A.1.6 Mutation Operator.

We performed the same indicator regression analysis to assess the differential performance of random (R) versus MAD_Dist (M) mutation, the results of which indicated that MAD_Dist promoted an increase in “similar” responses by 0.242 with a \(p\)-value of 0.024 over all of the data. An analysis of which mutation operator is best, however, is more nuanced than this initial gross regression analysis might indicate.

Additionally, tests of proportions and chi square tests of the raw responses were made between the two mutation operators while holding constant the remaining parameters. As with the similar analysis done for neural networks, this resulted in 32 hypothesis tests for each of the two methods. The results of 27 of 32 tests of proportions were statistically significant, and the distributions of raw responses were statistically significant in 25 of 32 cases. In the former tests, MAD_Dist was better in 14 instances and random in 13 instances. In the latter tests, MAD_Dist was statistically better in 13 instances and RAND in 12. When MAD_Dist was statistically better in the test of proportions, its proportion of “similar” responses was 0.468 greater, whereas the proportion for random mutation was 0.357 greater when its performance was statistically better. The same was true for the chi square test with the proportion for MAD_Dist mutation being 0.503 greater and the proportion for random mutation being 0.386 greater.

Despite these results seemingly favoring MAD_Dist mutation, we observe that 11 of the 12 best parameter sets in Table 5 employ random mutation. Therefore, the gross regression analysis is skewed toward MAD_Dist mutation because it is, on average, better than random mutation for parameter sets that are not among the best and which would not likely be used. The descriptive statistics here of simply which parameter sets are at the top of the list might therefore offer better guidance than the statistical tests of inference.

A.1.7 Computation Time.

It is certain that GAs are slower than gradient methods, so the point of this section is not to compute a precise difference in execution time between the two methods. Instead, we compute an estimate of the order of magnitude difference between the two methods and, more importantly, an analysis of the variation in computation time due to the GA parameters. Given the goal of making a rough estimate, we did execute the two methods on two different sets of hardware based on the ease of doing so without having to change versions of TensorFlow for one of the methods.

We were able to generate AEs with the original FGSM and the PGD method for all 60,000 MNIST training images in a range from 3 to 5 minutes running on a single NVIDIA 2080Ti GPU. Our variant of the FGSM took longer due to the need of classifying the candidate AEs, but it still generated 500 AEs in about an hour. In contrast, the execution time for GAs was between a few minutes and more than an hour per completed AE. At best then, GAs are roughly 60,000 times slower than are gradient methods.

The speed of GAs is highly dependent on which neural network is used and also, albeit to a much lesser degree, on the mutation operator. Table A.3 shows that GAs using the FF neural network take on the order of 3 minutes to generate an AE, whereas algorithms using the CNN take more than 1 hour and 40 minutes. With the FF neural network, the distribution of mean execution times is less for random mutation than for MAD_Dist mutation without any overlap: random mutation takes approximately 15 fewer seconds with FF neural networks than does MAD_Dist mutation. The execution time distributions overlap with the CNN, so whatever effect the mutation operator may have on execution time, it is possibly obscured by variation in the effect of the CNN on execution time.

Table A.3.

Statistic	FF---R	FF---M	CNN---R	CNN---M
Minimum	165	179	6,038	6,075
Mean	169	184	6,243	6,196
Median	167	182	6,211	6,156
Maximum	176	191	6,614	6,714

View Table

Table A.3. Mean Computation Times in Seconds per AE of GAs for Combinations of Neural Networks and Mutation Operators, Where * Indicates the Parameters over Which the Mean Computation Times Were Computed

A.2 CIFAR-10 Dataset

This section contains details of the statistical analysis for the CIFAR-10 dataset.

A.2.1 Inter-Rater Reliability.

We used Kendall’s W to evaluate the consistency of respondents’ ranking of GA parameter sets and gradient methods as determined by the proportion of “quite similar” responses. The null hypothesis of this nonparametric test is that the rankings of respondents have no correlation. The W statistic is computed and tested with a chi square sampling distribution. That test yielded \(p=0.00,\) so we reject the null hypothesis in favor of the alternate hypothesis that the respondents rankings are correlated, which indicates consistency in their visual perception of the relative similarity of AEs generated by the various parameter sets.

A.2.2 Overview of Results and Comparison with Benchmarks.

Figure A.3 shows an uncorrected version of the analysis depicted in Figure 9. Although the HBC method enlarges some equivalence classes, the uncorrected test also indicates that the two best parameter sets lie alone in their equivalence classes and that the two gradient methods lie alone in their equivalence classes with the lowest proportion of “quite similar” responses. Their proportions of “quite similar” responses are statistically different from any GA parameter set and also from each other in both Figures A.3 and 9.

Fig. A.3. Map of statistical significance of the test of proportions among GA parameter sets and gradient methods. Axes’ tick mark labels refer to the indices in the first and fourth columns of Table 7.

A.2.3 Fitness Functions.

We performed an analysis to evaluate the effectiveness of the fitness functions using the same approach as we did for MNIST by testing whether a statistically significant difference exists between the proportions for each fitness function and the remaining alternatives pooled together. As with the case with MNIST, we need to take into account that the L1, L2, and Linf fitness functions are equivalent with their linear forms when using the RL and RNL ranking-based selection operators. We recognize that circumstance by attributing the performance for L2-RL, for example, to both the L2 and L2L fitness functions. The results are shown in Table A.4 ordered in descending order of differential proportion of “quite similar” responses relative to the proportion for the remaining alternatives pooled. Only L2 and L2L had a positive differential proportion relative to the other alternatives, and only L2 had a \(p\)-value (0.001) below the 0.05 threshold, although the significance of L2L is just above the 0.05 threshold. It is not surprising to observe that L2 and L2L have similar performance given that two of the three parameter sets for each are common. Similarly, L1 and L1L occupy adjacent positions in the performance ranking, as do Linf and LinfL. Note that the reciprocal fitness functions always performed better than their linear counterparts, which is due to their better performance with proportional selection.

The observation that L2 and L2L are the best fitness functions raises the question of which is best. We note again that these two fitness functions yield different performance only with proportional selection, L2-P versus L2L-P, because their performance is equivalent with ranking-based selection. Although L2-P and L2L-P could be tested statistically, the point is moot given the superiority of ranking selection over proportional selection as discussed in the following.

Table A.4.

Fitness		Differential
Function	\(p\)-Value	Proportion
L2	0.001	0.099
L2L	0.091	0.062
L1	0.930	–0.004
L1L	0.546	–0.024
Linf	0.186	–0.050
LinfL	0.164	–0.052

View Table

Table A.4. Statistical Significance and Magnitude of Fitness Function Effect on the Proportion of “Quite Similar” Responses

A.2.4 Selection Operator.

We performed the same regression analysis as in the MNIST analysis to test the differences in proportions of “quite similar” responses between each selection operator and the other two operators combined. (See Table A.5.) None of the tests were significant at \(\alpha = 0.05\) or better, but the RL operator was the only alternative with a positive differential proportion relative to the proportion for the alternative selection operators combined. Subsequent regression tests of the RL methods versus the RNL and P methods showed that the RL method was expected to have a higher proportion of “quite similar” responses by 0.059 and 0.054, respectively, but the \(p\)-values were not significant at \(p=0.254\) and \(p=0.219\), respectively. Thus, RL selection yielded the highest overall proportion of “quite similar” responses, although not at the typical threshold of statistical significance. Rankwise, the three parameter sets that included RL selection were among the best five parameter sets.

Table A.5.

Selection		Differential
Operator	\(p\)-Value	Proportion
RL	0.132	0.056
RNL	0.548	–0.023
P	0.469	–0.024

View Table

Table A.5. Statistical Significance and Magnitude of Selection Operator Effect on the Proportion of “Quite Similar” Responses for CIFAR-10 Images

A.2.5 Computation Time.

The execution time of the gradient methods was again, as expected, shorter than for the GAs. The PGD method and FGSM mean execution time were 0.131 and 0.018 seconds per AE, respectively. Table A.6 shows the mean computation time for the GAs, which are most often on the order or almost 30 minutes per AE.

Table A.6.

Fitness	Selection	Mean
L1	P	1,689
L1	RL	1,777
L1	RNL	1,791
L1-lin	P	1,769
L2	P	482
L2	RL	1,770
L2	RNL	1,757
L2-lin	P	1,769
Linf	P	1,743
Linf	RL	1,775
Linf	RNL	1,769
Linf-lin	P	288

View Table

Table A.6. Mean Computation Times in Seconds per AE of the GA for CIFAR-10 Images

ACKNOWLEDGMENTS

The authors acknowledge William & Mary Research Computing for providing computational resources and/or technical support that have contributed to the results reported within this article. URL: https://www.wm.edu/it/rc.

REFERENCES

Alzantot Moustafa, Balaji Bharathan, and Srivastava Mani. 2018a. Did you hear that? Adversarial examples against automatic speech recognition. arxiv:cs.CL/1801.00554 (2018).Google Scholar
Reference
Alzantot Moustafa, Sharma Yash, Elgohary Ahmed, Ho Bo-Jhang, Srivastava Mani, and Chang Kai-Wei. 2018b. Generating natural language adversarial examples. arxiv:cs.CL/1804.07998 (2018).Google Scholar
Reference
Bäck Thomas and Hoffmeister Frank. 1991. Extended selection mechanisms in genetic algorithms. In Proceedings of the 4th International Conference on Genetic Algorithms (ICGA’91). 92–99.Google Scholar
Reference 1Reference 2
Edward Baker. James 1985. Adaptive selection methods for genetic algorithms. In Proceedings of the 1st International Conference on Genetic Algorithms. 101–111.Google Scholar
Bala Anju. 2017. A review of selection strategies in genetic algorithm. International Journal of Advance Research in Computer Science and Management Studies 5, 6 (2017), 133–141.Google Scholar
Reference
Bastani Osbert, Ioannou Yani, Lampropoulos Leonidas, Vytiniotis Dimitrios, Nori Aditya, and Criminisi Antonio. 2016. Measuring neural net robustness with constraints. In Advances in Neural Information Processing Systems. 2613–2621.Google ScholarDigital Library
Reference
Blickle Tobias and Thiele Lothar. 1996. A comparison of selection schemes used in evolutionary algorithms. Evolutionary Computation 4, 4 (1996), 361–394.Google ScholarDigital Library
Reference 1Reference 2
Brendel Wieland, Rauber Jonas, and Bethge Matthias. 2018. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=SyZI0GWCZ.Google Scholar
Reference
Brown Tom B., Mané Dandelion, Roy Aurko, Abadi Martín, and Gilmer Justin. 2017. Adversarial patch. arXiv preprint arXiv:1712.09665 (2017).Google Scholar
Reference
Carlini Nicholas and Wagner David. 2017. Towards evaluating the robustness of neural networks. In Proceedings of the IEEE Symposium on Security and Privacy. 39–57.Google ScholarCross Ref
Reference 1Reference 2
Casey Bryan, Farhangi Ashkan, and Vogl Roland. 2019. Rethinking explainable machines: The GDPR’S “right to explanation” debate and the rise of algorithmic audits in enterprise. Berkeley Technology Law Journal 34, 145 (2019), 145–189.Google Scholar
Reference
Chen Jianbo, Jordan Michael I., and Wainwright Martin J.. 2020. HopSkipJumpAttack: A query-efficient decision-based attack. In Proceedings of the 41st IEEE Symposium on Security and Privacy.Google ScholarCross Ref
Reference 1Reference 2
Ciresan D. C., Meier U., Gambardella L. M., and Schmidhuber J.. 2011. Convolutional neural network committees for handwritten character classification. In Proceedings of the 2011 International Conference on Document Analysis and Recognition. 1135–1139. DOI:Google ScholarDigital Library
Reference
Corus Dogan, Lissovoi Andrei, Oliveto Pietro S., and Witt Carsten. 2021. On steady-state evolutionary algorithms and selective pressure: Why inverse rank-based allocation of reproductive trials is best. ACM Transactions on Evolutionary Learning and Optimization 1, 1 (April 2021), Article 2, 38 pages. DOI:Google ScholarDigital Library
Reference
Deng Li. 2012. The MNIST database of handwritten digit images for machine learning research [Best of the Web]. IEEE Signal Processing Magazine 29, 6 (2012), 141–142. DOI:Google ScholarCross Ref
Reference 1Reference 2
Recognizer Digit. 2020. How to Choose CNN Architecture MNIST. Retrieved October 21, 2020 from https://www.kaggle.com/code/cdeotte/how-to-choose-cnn-architecture-mnis.Google Scholar
Reference
Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929 (2021).Google Scholar
Reference
Parliament European and 2016/679 Council of the European Union. 2016. Regulation (EU) of the European Parliament and of the Council. Retrieved February 7, 2023 from https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=E. Sonia M. Gibson-Rankin. 2021. Technological tethereds: Potential impact of untrustworthy artificial intelligence in criminal justice risk assessment instruments. Washington and Lee Law Review 78, 2 (2021), 646–724.Google Scholar
Reference
D. E. Goldberg. 1989. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley.Google Scholar
Reference
Goldberg David E. and Deb Kalyanmoy. 1991. A comparative analysis of selection schemes used in genetic algorithms. In Foundations of Genetic Algorithms, Vol. 1. Elsevier, 69–93.Google Scholar
Reference
Goodfellow Ian J., Shlens Jonathon, and Szegedy Christian. 2014. Explaining and harnessing adversarial examples. CoRR abs/1412.6572 (2014).Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Hendrycks Dan and Dietterich Thomas. 2019. Benchmarking neural network robustness to common corruptions and perturbations. In Proceedings of the International Conference on Learning Representations.Google Scholar
Reference
Holland John Henry. 1975. Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, MI.Google ScholarDigital Library
Reference
Kirkpatrick Keith. 2017. It’s not the algorithm, it’s the data. Communications of the ACM 60, 2 (2017), 21–23.Google ScholarDigital Library
Reference
Krizhevsky Alex, Nair Vinod, and Hinton Geoffrey. 2022. The CIFAR-10 Dataset. Retrieved September 23, 2022 from https://www.cs.toronto.edu/kriz/cifar.html.Google Scholar
Reference 1Reference 2
Kuncel Nathan R., Slieger David M., and Ones Deniz S.. 2014. Harvard Business Review 92, 15 (2014), 32–32.Google Scholar
Reference
Kurakin Alexey, Goodfellow Ian J., and Bengio Samy. 2017. Adversarial machine learning at scale. In Proceedings of the 2017 International Conference on Learning Representations. https://openreview.net/forum?id=BJm4T4Kgx.Google Scholar
Reference
Yann LeCun, Cortes Corinna, and Burges Christopher J. C.. 2020. The MNIST Database. Retrieved February 7, 2023 from http://yann.lecun.com/exdb/mnist/.Google Scholar
Reference 1Reference 2Reference 3
Madry Aleksander, Makelov Aleksandar, Schmidt Ludwig, Tsipras Dimitris, and Vladu Adrian. 2017. Towards deep learning models resistant to adversarial attacks. arxiv:stat.ML/1706.06083 (2017).Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Madry Aleksander, Makelov Aleksandar, Schmidt Ludwig, Tsipras Dimitris, and Vladu Adrian. 2020. MadryLab/mnist_challenge. Retrieved October 13, 2020 from https://github.com/MadryLab/mnist_challenge.Google Scholar
Reference
Madry Aleksander, Makelov Aleksandar, Schmidt Ludwig, Tsipras Dimitris, and Vladu Adrian. 2022. MadryLab/cifar10_challenge. Retrieved November 17, 2022 from https://github.com/MadryLab/cifar10_challenge.Google Scholar
Reference
Mann Gideon and O’Neil Cathy. 2016. Hiring algorithms are not neutral. Harvard Business Review. Retrieved February 7, 2023 from https://hbr.org/2016/12/hiring-algorithms-are-not-neutral.Google Scholar
Reference
Mannino Michael V. and Koushik Murlidhar V.. 2000. The cost-minimizing inverse classification problem: A genetic algorithm approach. Decision Support Systems 29, 3 (2000), 283–300.Google ScholarDigital Library
Reference 1Reference 2
MetaAI. 2022. Image Classification on CIFAR-10. Retrieved November 17, 2022 from https://paperswithcode.com/sota/image-classification-on-cifar-10.Google Scholar
Reference
Moosavi-Dezfooli Seyed-Mohsen, Fawzi Alhussein, and Frossard Pascal. 2016. DeepFool: A simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2574–2582.Google ScholarCross Ref
Reference
Razali Noraini Mohd and Geraghty John. 2011. Genetic algorithm performance with different selection strategies in solving TSP. In Proceedings of the World Congress on Engineering, Vol. II. 1–6.Google Scholar
Reference 1Reference 2Reference 3
Sharma Shubham, Henderson Jette, and Ghosh Joydeep. 2019. CERTIFAI: Counterfactual explanations for robustness, transparency, interpretability, and fairness of artificial intelligence models. CoRR abs/1905.07857 (2019).Google Scholar
Reference
Sharma Yash and Chen Pin-Yu. 2018. Attacking the Madry defense model with \(L_1\)-based adversarial examples. In Proceedings of the International Conference on Learning Representations.Google Scholar
Reference 1Reference 2Reference 3
Spielkamp Matthias. 2017. Inspecting algorithms for bias. MIT Technology Review 4 (2017), 96–98.Google Scholar
Reference
Syswerda Gilbert. 1991. A study of reproduction in generational and steady-state genetic algorithms. In Foundations of Genetic Algorithms, Vol. 1. Elsevier, 94–101. DOI:Google ScholarCross Ref
Reference 1Reference 2
Szegedy C., Zaremba W., Sutskever I., and Bruna J.. 2014. Intriguing properties of neural networks. In Proceedings of the International Conference on Learning Representations. 49–58. https://arxiv.org/abs/1312.6199.Google Scholar
Reference 1Reference 2
Taori Rohan, Kamsetty Amog, Chu Brenton, and Vemuri Nikita. 2019. Targeted adversarial examples for black box audio systems. (2019). arxiv:cs.LG/1805.07820 (2019).Google Scholar
Reference
Tramèr Florian, Papernot Nicolas, Goodfellow Ian, Boneh Dan, and McDaniel Patrick. 2017. The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453 (2017).Google Scholar
Reference 1Reference 2Reference 3
Vidnerová Petra and Neruda Roman. 2020. Vulnerability of classifiers to evolutionary generated adversarial examples. Neural Networks 127 (2020), 168–181.Google ScholarCross Ref
Reference 1Reference 2Reference 3
Wachter Sandra, Mittlestadt Brent, and Russell Chris. 2018. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harvard Journal of Law & Technology 31, 2 (Spring 2018), 1–47.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Wilson H. James and Daugherty Paul R.. 2018. Collaborative intelligence: Humans and AI are joining forces. Harvard Business Review 96, 4 (2018), 114–123.Google Scholar
Reference
Wong Eric, Rice Leslie, and Kolter J. Zico. 2020. Fast is better than free: Revisiting adversarial training. arxiv:cs.LG/2001.03994 (2020).Google Scholar
Reference
Yang Zhuolin, Li Bo, Chen Pin-Yu, and Song Dawn. 2019. Characterizing audio adversarial examples using temporal dependency. arxiv:cs.LG/1809.10875 (2019).Google Scholar
Reference
Zhang H., Avrithis Y., Furon T., and Amsaleg L.. 2021. Walking on the edge: Fast, low-distortion adversarial examples. IEEE Transactions on Information Forensics and Security 16 (2021), 701–713.Google ScholarCross Ref
Reference
Zhang Huan, Weng Tsui-Wei, Chen Pin-Yu, Hsieh Cho-Jui, and Daniel Luca. 2018. Efficient neural network robustness certification with general activation functions. In Advances in Neural Information Processing Systems. 4939–4948.Google Scholar
Reference

Index Terms

The Generation of Visually Credible Adversarial Examples with Genetic Algorithms
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Computing methodologies

Index terms have been assigned to the content through auto-classification.

Recommendations

Neural network crossover in genetic algorithms using genetic programming
Abstract
The use of genetic algorithms (GAs) to evolve neural network (NN) weights has risen in popularity in recent years, particularly when used together with gradient descent as a mutation operator. However, crossover operators are often omitted from ...
Read More
Genetic algorithms for evolving deep neural networks
GECCO Comp '14: Proceedings of the Companion Publication of the 2014 Annual Conference on Genetic and Evolutionary Computation

In recent years, deep learning methods applying unsupervised learning to train deep layers of neural networks have achieved remarkable results in numerous fields. In the past, many genetic algorithms based methods have been successfully applied to ...
Read More
Biological engineering applications of feedforward neural networks designed and parameterized by genetic algorithms

Two neural network (NN) applications in the field of biological engineering are developed, designed and parameterized by an evolutionary method based on the evolutionary process of genetic algorithms. The developed systems are a fault detection NN model ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Evolutionary Learning and Optimization Volume 3, Issue 1
March 2023
109 pages
EISSN:2688-3007
DOI:10.1145/3589017
Editors:
Juergen Branke
Warwick Business School, UK
,
Manuel López-Ibáñez
University of Málaga Spain
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 March 2023
- Online AM: 30 January 2023
- Accepted: 20 January 2023
- Revised: 17 January 2023
- Received: 2 June 2022
Published in telo Volume 3, Issue 1

Check for updates
Author Tags
Genetic algorithms
neural networks
adversarial examples
MNIST
CIFAR-10
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 751
  Total Downloads
- Downloads (Last 12 months)511
- Downloads (Last 6 weeks)73
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The Generation of Visually Credible Adversarial Examples with Genetic Algorithms

ACM Transactions on Evolutionary Learning and Optimization

Abstract

1 INTRODUCTION

2 RELATED LITERATURE

3 GA FOR THE MNIST DATASET

3.1 MNIST Dataset

3.2 A GA for Generating AEs

3.2.1 Fitness Functions.

3.2.2 Overall Selection.

3.2.3 Elitist Selection.

3.2.4 Reproduction and Crossover.

3.2.5 Mutation Operators and Population Initialization.

3.2.6 Reference Neural Networks.

3.2.7 Remaining GA Parameters.

3.2.8 Algorithm Summary.

3.2.9 Experimental Trials.

3.2.10 Implementation.

4 BENCHMARK GENERATIVE METHODS

5 EXPERIMENT FOR ASSESSING HUMAN PERCEPTION OF VISUAL SIMILARITY

5.1 Experiment Description

5.2 Survey Design Rationale

5.2.1 Survey Question Scale.

5.2.2 Data Integrity.

5.2.3 Survey Pool.

6 ANALYSIS: VISUAL SIMILARITY OF AE GENERATIVE METHODS FOR MNIST

6.1 Overall Summary of Results

6.2 Statistical Significance of Proportions

6.3 GAs versus Gradient Methods

6.4 Best-Performing GA Parameters

6.4.1 Reference Neural Network.

6.4.2 Fitness Function.

6.4.3 Selection Operator.

6.4.4 Mutation Operator.

7 GA FOR THE CIFAR-10 DATASET

7.1 CIFAR-10 Dataset

7.2 A GA for Generating AEs

7.2.1 Mutation Operators.

7.3 Population Initialization

7.3.1 Fitness, Selection, Reproduction, Crossover, and Benchmarks.

7.4 Reference Neural Network

7.5 Algorithm Summary

7.6 Benchmark Methods

8 ANALYSIS: VISUAL SIMILARITY OF AE GENERATIVE METHODS FOR CIFAR-10

8.1 Overview of Results and Comparison with Benchmarks

8.2 Best-Performing GA Parameters

9 DISCUSSION AND CONCLUSION

APPENDIX

A STATISTICAL ANALYSIS

A.1 MNIST Dataset

A.1.1 Inter-Rater Reliability.

A.1.2 Statistical Significance of Differences in Proportions.

A.1.3 Reference Neural Network.

A.1.4 Fitness Function.

A.1.5 Selection Operator.

A.1.6 Mutation Operator.

A.1.7 Computation Time.

A.2 CIFAR-10 Dataset

A.2.1 Inter-Rater Reliability.

A.2.2 Overview of Results and Comparison with Benchmarks.

A.2.3 Fitness Functions.

A.2.4 Selection Operator.

A.2.5 Computation Time.

ACKNOWLEDGMENTS

REFERENCES

Cited By

Index Terms

Recommendations

Neural network crossover in genetic algorithms using genetic programming

Genetic algorithms for evolving deep neural networks

Biological engineering applications of feedforward neural networks designed and parameterized by genetic algorithms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates