Realism and label correctness
While it is desirable that the output of any augmentation method be different from the original data in order to better minimize
\(R_{\textrm{vic}}\) ("
Method"), we want to avoid sampling synthetic points off the original data manifold, thereby also ensuring trustworthy machine learning [
90].
Consider the CRESCENTS and the SPIRALS datasets, two 2D synthetic data distribution described in "
Synthetic Data" Section and visualized as “Input” in Fig.
2a. Applying
mixup to CRESCENTS and SPIRALS datasets shows that
mixup does not respect the individual class boundaries and synthesizes samples off the data manifold, also known as manifold intrusion [
25]. This also results in the generated samples being wrongly labeled, i.e., points in the “red” class’s region being assigned “blue” labels and vice versa, which we term as “label error”. On the other hand,
\(\zeta \)-
mixup preserves the class decision boundaries irrespective of the hyperparameter
\(\gamma \) and additionally allows for a controlled interpolation between the original distribution and
mixup-like output. With
\(\zeta \)-
mixup, small values of
\(\gamma \) (greater than
\(\gamma _{\textrm{min}}\); see Theorem
1) lead to samples being generated further away from the original data and as
\(\gamma \) increases, the resulting distribution approaches the original data.
Applying
mixup in 3D space (Fig.
2b) results in a somewhat extreme case of the generated points sampled off the data manifold, filling up the entire hollow region in between the helical distribution.
\(\zeta \)-
mixup, however, similar to Fig.
2a, generates points that are relatively much closer to the original points, and increasing the value of
\(\gamma \) to a large value, say
\(\gamma =6.0\), leads the generated samples to lie almost perfectly on the original data manifold.
Moving on to higher dimensions with the MNIST data, i.e., 784-D, we observe that the problems with
mixup ’s output are even more severe and that the improvements by using
\(\zeta \)-
mixup are more conspicuous. For each digit class in the MNIST dataset, we take the first 10 samples as shown in Fig.
3a and use
mixup and
\(\zeta \)-
mixup to generate 100 new images each (Fig.
3b, c). It is easy to see that the digits in
\(\zeta \)-
mixup ’s output are more discernible than those in
mixup ’s output.
Finally, to analyze the correctness of probabilistic labels in the outputs of
mixup and
\(\zeta \)-
mixup, we pick 4 samples each from the respective outputs and inspect their probabilistic soft labels.
mixup ’s outputs (Fig.
3d) all look like images of handwritten “8”. The soft label of the first digit in Fig.
3d is [0, 0.53, 0, 0, 0, 0.47, 0, 0, 0, 0], where the
\(i^{\textrm{th}}\) index is the probability of the
\(i^{\textrm{th}}\) digit, implying that this output has been obtained by mixing images of digits “1” and “5”. Interestingly, neither the resulting output looks like the digits “1” or “5” nor is the digit “8” one of the classes used as input for this image. I.e., there is a disagreement, with
mixup, between the appearance of the synthesized image and its assigned label. Similar label error exists in the other images in Fig.
3d. On the other hand, there is a clear agreement between the images produced by
\(\zeta \)-
mixup and the labels assigned to them (Fig.
3e).
Next, we set out to quantify (i) realism and (ii) label correctness of
mixup and
\(\zeta \)-
mixup-synthesized images. To this end, we assume access to an Oracle that can recognize MNIST digits. For (i), we hypothesize that the more an image is realistic, the more the Oracle will be certain about the digit in it, and vice-versa. For example, although the first image in Fig.
3d is a combination of a “1” and a “5”, the resulting image looks very similar to a realistic handwritten “8”. On the other hand, consider the highlighted and zoomed digits in Fig.
3b. For an Oracle, images like these are ambiguous and do not belong to one particular class. Consequently, the uncertainty of the Oracle’s prediction will be high. We therefore adopt the Oracle’s entropy (
\({\mathcal {H}}\)) as a proxy for realism. For (ii), we use cross entropy (CE) to compare the soft labels assigned by either
mixup or
\(\zeta \)-
mixup to the label assigned by the Oracle. For example, if the resulting digit in a synthesized image is deemed an “8” to an Oracle and the label assigned to the sample, by
mixup or
\(\zeta \)-
mixup, is also “8”, then the CE is low and the label is correct. We also note that for the Oracle, the certainty of the predictions is correlated with the correctness of label. Finally, to address the issue of what Oracle to use, we adopt a highly accurate LeNet-5 [
26] MNIST digit classifier that achieves
\(99.31\%\) classification accuracy on the standardized MNIST test set.
Figure
3f, g show the quantitative results for the realism (
\(\propto \) 1/
\({\mathcal {H}}\)) of
mixup and
\(\zeta \)-
mixup ’s outputs, and the correctness of the corresponding labels (
\(\propto \) 1/CE) as evaluated by the Oracle, respectively, using kernel density estimate (KDE) plots with normalized areas. For both metrics, lower values (along the horizontal axes) are better. In Fig.
3f, we observe the
\(\zeta \)-
mixup has a higher peak for low values of entropy as compared to
mixup, indicating that the former generates more realistic samples. The inset figure therein shows the same plot with a logarithmic scale for the density, and
\(\zeta \)-
mixup ’s improvements over
mixup for higher values of entropy are clearly discernible here. Similarly, in Fig.
3g, we see that the cross entropy values for
\(\zeta \)-
mixup are concentrated around 0, whereas those for
mixup are spread out more widely, implying that the former produces fewer samples with label error. If we restrict our samples to only those whose entropy of Oracle’s predictions was less than 0.1, meaning they were highly realistic samples, the label correctness distribution remains similar as shown in the inset figure, i.e.,
mixup ’s outputs that look realistic are more likely to exhibit label error.
Note that similar problems with unrealistic synthesized images exist with skin lesion images, as shown in the outputs of
mixup applied to 100 samples from ISIC 2017 (Fig.
4) and ISIC 2018 (Fig.
5) datasets.
mixup generates images that contain (1) overlapping lesions with different diagnoses, (2) overlapping artifacts (dark corners, stickers, ink markers, hair, etc.) overlapping the lesion, or (3) images with unrealistic anatomical arrangements such as lesion or hair appearing outside the body. However, despite
\(\zeta \)-
mixup ’s outputs exhibiting a higher degree of realism compared to those of
mixup, we acknowledge that it is difficult to accurately estimate the realism of medical images without expert assessment.
Preserving the intrinsic dimensionality of the original data
As a direct consequence of the realism of synthetic data discussed above and its relation to the data manifold, we evaluate how the intrinsic dimensionality (ID hereafter) of the datasets change when mixup and \(\zeta \)-mixup are applied.
According to the manifold hypothesis, the probability mass of high-dimensional data such as images, speech, text, etc. is highly concentrated, and optimization problems in such high dimensions can be solved by fitting low-dimensional non-linear manifolds to points from the original high-dimensional space, with this approach being known as manifold learning [
53,
54,
59]. This idea that real world image datasets can be described by considerably fewer dimensional representations [
91], also known as the intrinsic dimensionality, has fuelled research into lower dimensional representation learning techniques such as autoencoders [
92,
93]. Moreover, recent research has concluded that deep learning models are easier to train on datasets with low dimensionalities and that such models exhibit better generalization performance [
45].
While the ID of a dataset can be estimated globally, datasets can have heterogeneous regions and thus consist of regions of varying IDs. As such, instead of a global estimate of the ID, a local measure of the ID (local ID hereafter), estimated in the local neighborhood of each point in the dataset with neighborhoods typically defined using the
k-nearest neighbors, is more informative of the inherent organization of the dataset. For our local ID estimation experiments, we use a principal component analysis-based local ID estimator from the
scikit-dimension
Python library [
94] using the Fukunaga-Olsen method [
95], where an eigenvalue is considered significant if it is larger than
\(5\%\) of the largest eigenvalue.
With our 3D manifold visualizations in Fig.
2b, we saw that
mixup samples points off the data manifold while
\(\zeta \)-
mixup limits the exploration of the high-dimensional space, thus maintaining a lower ID. In order to substantiate this claim with quantitative results, we estimate the IDs of several datasets, both synthetic and real-world, and compare how the IDs of
mixup- and
\(\zeta \)-
mixup-generated distributions compare to those of the respective original distributions. For synthetic data, we use the high-dimensional datasets described in "
Synthetic data", i.e., 1-D helical manifolds embedded in
\(\mathbb {R}^3\) and in
\(\mathbb {R}^{12}\). For real-world datasets, we use the entire training partitions (50,000 images) of CIFAR-10 and CIFAR-100 datasets.
For each point in all the 4 datasets, the local ID is calculated using a
k-nearest neighborhood around each point with
\(k=8\) and
\(k=128\) [
94,
95]. The means and the standard deviations of the local ID estimates for all the datasets: original data distribution,
mixup ’s output, and
\(\zeta \)-
mixup ’s outputs for
\(\gamma \in [0, 15]\), are visualized in Fig.
6.
The results in Fig.
6 support the observations from the discussion around the realism ("
Realism and Label Correctness" Section) and the diversity ("
Diversity") of outputs. In particular, notice how
mixup ’s off-manifold sampling leads to an inflated estimate of the local ID, whereas the local ID of
\(\zeta \)-
mixup ’s output is lower than that of
mixup and, as expected, can be controlled using
\(\gamma \). This difference is even more apparent with real-world high-dimensional (3072-D) datasets, i.e., CIFAR-10 and CIFAR-100, where for all values of
\(\gamma \ge \gamma _{\textrm{min}}\) (Theorem
1), as
\(\gamma \) increases, the local ID of
\(\zeta \)-
mixup ’s output drops dramatically, meaning the resulting distributions lie on progressively lower dimensional intrinsic manifolds.
We note, however, that for some datasets,when employing large values of
\(\gamma \), the local ID of
\(\zeta \)-
mixup outputs may be lower than the local ID of the original dataset (Fig.
6). Since we use the same number of nearest neighbors (
\(n_{\textrm{NN}} = \{8, 128\}\)) across all methods to perform PCA-based local ID estimation [
95], higher values of
\(\gamma \) lead to synthesized samples being closer to each other and the distribution of the resulting augmented samples being more compact than the original dataset (“vanilla” in Fig.
6). Fig.
7 shows a visual explanation for this: consider a synthetic two-class 2D data distribution, and its
mixup and
\(\zeta \)-
mixup augmented outputs (Fig.
7a–c) respectively). We see that if we were to estimate the local ID for this data without any augmentation (Fig.
7d), the samples are comparatively more spread out, compared to
\(\zeta \)-
mixup outputs (Fig.
7e). If we were to fit an ellipse (representing the covariance of the data or the result of PCA) to estimate the local ID, notice how
\(\zeta \)-
mixup ’s more compact distribution leads to an ellipse with higher eccentricity than the one for the original distribution.
Evaluation on downstream task: classification
We compare the classification performance of models trained using traditional data augmentation techniques, e.g., rotation, horizontal and vertical flipping, and cropping (“ERM”), against those trained with mixup ’s and \(\zeta \)-mixup ’s outputs. Additionally, we also evaluate if there are performance improvements when \(\zeta \)-mixup is applied in conjunction with an orthogonal augmentation technique, CutMix.
We do not compare against optimization-based mixing methods (e.g., Co-Mixup [
96]), which, while conceptually orthogonal to
\(\zeta \)-
mixup and potentially complementary, involve the use of combinatorial optimization and specialized libraries
1. These methods, by design, introduce a significant computational overhead that places the burden of image understanding on the data augmentation process. This increased computational cost is evident in model training times. For instance, CIFAR-100 models trained using
mixup,
\(\zeta \)-
mixup, CutMix, and even the combination of CutMix and
\(\zeta \)-
mixup take up almost the same time as ERM (approximately 1h 20 m; Table
9). On the other hand, Co-Mixup, due to its reliance on optimation, requires training times that are over an order of magnitude larger (over 16h; similar to the training time in the official repository’s training log
2). We also refrain from extensive comparison against methods that interpolate in the latent space (e.g.,
manifold mixup [
41]) for two main reasons. First, the the computational demands associated with these methods are considerably higher: while ERM,
mixup,
\(\zeta \)-
mixup models trained on CIFAR-100 converge in a reasonable amount of time, typically within 200 epochs and approximately 1 h, training a model with
manifold mixup extends to 2000 epochs and requiring over 16 h (Table
9). Moreover, the theoretical justifications associated with such methods are not unanimously agreed upon [
97]. Nevertheless, despite this considerably higher computational burden, we compare
manifold mixup to
\(\zeta \)-
mixup on nine diverse natural and medical image classification datasets.
Table
4 presents the quantitative evaluation for the natural image datasets. For all our experiments with
mixup, we use the official implementation by the authors
3.
mixup samples its interpolation factor
\(\lambda \) from a Beta(
\(\alpha , \alpha \)) distribution, and following the original
mixup paper [
36], their code implementation
4, as well as several other works [
39,
42,
44,
98‐
100], we set
\(\alpha = 1\), which results in
\(\lambda \) being sampled from a
\(\textrm{U}[0, 1]\) uniform distribution. For all our experiments with
\(\zeta \)-
mixup, we synthesize new training samples through convex combinations (Eqn.
5, Eqn.
6) of all the samples in a training batch, i.e.,
T (number of samples used for interpolation)
\(= m\) (number of samples in a training batch). For comparison against
mixup-based models, we choose 3 values of
\(\gamma \) for the corresponding
\(\zeta \)-
mixup-based models:
-
\(\gamma =2.4\): to allow exploration of the space around the original data manifold,
-
\(\gamma =4.0\): to restrict the synthetic samples to be close to the original samples, and
-
\(\gamma =2.8\): to allow for a behavior that permits exploration while still restricting the points to a small region around the original distribution.
We see that 17 of the 18 models in Table
4 trained with
\(\zeta \)-
mixup outperform their ERM and
mixup counterparts, with the lone exception being a model that is as accurate as
mixup. We also observe a performance improvement when
\(\zeta \)-
mixup is applied along with CutMix, as shown in Table
5. To show that the performance gains from
\(\zeta \)-
mixup are achievable for all reasonable values of
\(\gamma \), for these experiments, we sample a new
\(\gamma \in \textrm{U}[1.72865, 4.0]\) for each mini-batch.
Next, Table
6 shows the performance of the models on the 10 skin lesion image diagnosis datasets (
\(\gamma =\{2.4, 2.8, 4.0\}\)). For both ResNet-18 and ResNet-50 and for all the 10 SKIN datasets,
\(\zeta \)-
mixup outperforms both
mixup and ERM on skin lesion diagnosis tasks. Finally, Table
7 presents the quantitative evaluation on the 8 classification datasets from the MedMNIST collection, but use
\(\zeta \)-
mixup only with
\(\gamma =2.8\). In 8 out of the 10 datasets,
\(\zeta \)-
mixup outperforms both
mixup and ERM, and in the other 2,
\(\zeta \)-
mixup achieves the highest value for 1 metric out of 2 each.
Note that these selected values of
\(\gamma \) can be changed to other reasonable values (see "
\(\zeta \)-mixup: hyperparameter sensitivity analysis and ablation study" for sensitivity analysis of
\(\gamma \)), and as shown above qualitatively and quantitatively, the desirable properties of
\(\zeta \)-
mixup hold for all values of
\(\gamma \ge \gamma _{\textrm{min}}\). Consequently, our quantitative results on classification tasks on 26 datasets show that
\(\zeta \)-
mixup outperforms ERM and
mixup for all the datasets and, in most cases, using all three selected values of
\(\gamma \).
For a more intuitive explanation of how
\(\zeta \)-
mixup leads to superior performance, let us revisit the synthetic data distribution in Fig.
7, now with a test sample (denoted by a green square). With
mixup, the test sample may lie in the vicinity of incorrectly labeled
mixup-augmented training samples. We study the classes of the samples in the vicinity of a test sample using its
k-nearest neighbors,
\(k = \{8, 16\}\). Such errors, i.e., a test sample falling in the vicinity of training samples of a different class leading to misclassification, are less likely with
\(\zeta \)-
mixup since it generates training samples that are closer to the original data distribution.
This can also be observed on real-world datasets. We choose two skin lesion image datasets from our experiments spanning two imaging modalities, and two model architectures for our analysis: the ResNet-50 model trained on ISIC 2017 (dermoscopic images) and the ResNet-18 model trained on derm7point: Clinical (clinical images). Fig.
8a shows 14 sample images from the test sets of each of the two datasets that were misclassified by both ERM and
mixup, but were correctly classified by
\(\zeta \)-
mixup for all values of
\(\gamma \) (Table
6). To study the distribution of training samples and their labels in the vicinity of these test images, we perform the following analysis: for both the models, we generate
mixup- and
\(\zeta \)-
mixup-synthesized training samples, and compute their features using the pre-trained classification models. This results in 2048-dimensional and 512-dimensional feature vectors for ISIC 2017 (ResNet-50) and derm7point (ResNet-18), respectively. For 12 of these 14 test images from derm7point (Fig.
8a), there were more training samples with correct labels in the vicinity of the test samples (measured by calculating the 128-nearest neighbors in the 512-dimensional feature space) for the
\(\zeta \)-
mixup-trained model than the
mixup-trained model. Overall, the number of correctly labeled nearest neighbor training samples was
\(208.2\%\) more for
\(\zeta \)-
mixup compared to
mixup. The corresponding numbers for ISIC 2017 (2048-dimensional feature space) were 14 out of 14 test samples and
\(1908.8\%\) more correctly labeled nearest neighbor training samples. The distances for the nearest neighbors were calculated using cosine similarity.
Next, we project these onto a 2D embedding space through t-distributed Stochastic Neighbor Embedding (t-SNE) [
101] using the
openTSNE
Python library [
102], representing each training sample’s feature using a class color-coded circle. Finally, we project the test samples’ features onto the same embedding spaces, denoted by squares. It should be noted that this t-SNE representation drastically reduces the dimensionality of the features (
\(\{512, 2048\}\)-D
\(\rightarrow 2\)-D), causing some information loss. We observe that with
mixup (Fig.
8b, d), several test samples fall in the vicinity of training samples of a different class than the correct class of the test sample, potentially leading to misclassification. Examples of this include a ‘NEV’ misclassified as ‘MEL’, ‘NEV’ misclassified as ‘SK’, and ‘SK’ misclassified as ‘NEV’ in Fig.
8b and ‘NEV’ misclassified as ‘MEL’ and ‘MISC’ misclassified as ‘MEL’ in Fig.
8d. With
\(\zeta \)-
mixup, on the other hand, these test samples are less likely to have training images of a different class than the test sample’s class in their vicinity (Fig.
8c, e).
Finally, we also compare
\(\zeta \)-
mixup to the computationally intensive
manifold mixup. As mentioned above,
manifold mixup requires an order of magnitude more number of epochs for convergence. For instance, while all of ERM,
mixup, and
\(\zeta \)-
mixup require 200 epochs,
\(\zeta \)-
mixup is trained for 2000 epochs [
41]. However, in an effort to understand the performance gains obtained from such a massive computational requirement, we evaluate
manifold mixup on 9 datasets: we choose 2 datasets from NATURAL (CIFAR-10, CIFAR-100), 3 datasets from MEDMNIST (BreastMNIST, PathMNIST, TissueMNIST), and 4 datasets from SKIN (derm7point: Clinical, MSK, ISIC 2017, DermoFit), thus covering natural and medical image datasets of various resolutions (
\(28 \times 28\),
\(32 \times 32\),
\(224 \times 224\)), multiple medical imaging modalities (dermoscopic and clinical skin images, ultrasound images, histopathology images, microscopic images), image types (BreastMNIST and TissueMNIST are grayscale while others are RGB), and model architectures (ResNet-18, ResNet-50). For CIFAR-10 and CIFAR-100, we follow the experimental settings of Verma et al. [
41], and since they did not perform experiments on our other datasets, we scale the corresponding experimental settings (i.e., the number of training epochs and the learning rate scheduler milestones) accordingly. Therefore, for the 3 MEDMNIST datasets, the
manifold mixup-augmented classification models are trained for 1, 000 epochs with a learning rate of 0.01. For the 4 SKIN datasets, the
manifold mixup models are trained for 500 epochs with an initial learning rate of 0.01 decayed by a multiplicative factor of 0.1 every 100 epochs. The quantitative results for all metrics in all datasets are visualized in Fig.
9. For 2 datasets,
manifold mixup outperforms
\(\zeta \)-
mixup, and for 3 datasets,
manifold mixup achieves one superior metric than
\(\zeta \)-
mixup. However, for 4 datasets,
\(\zeta \)-
mixup outperforms
manifold mixup across all metrics. Therefore, despite being considerably more computationally intensive (each
manifold mixup model is trained for
\(10\times \) the number of epochs compared to a
\(\zeta \)-
mixup trained on the same dataset),
manifold mixup-trained models do not demonstrate a clear and consistent performance improvement over the comparatively more efficient
\(\zeta \)-
mixup.
\(\zeta \)-mixup: hyperparameter sensitivity analysis and ablation study
We conduct extensive experiments on CIFAR-10 and CIFAR-100 datasets to analyze the effect of \(\zeta \)-mixup ’s hyperparameter: \(\gamma \) on the performance of \(\zeta \)-mixup, and also analyze how the weight-decay of SGD-based optimization affects model performance.
First, we vary the hyperparameter
\(\gamma \) by choosing values from [1.8, 2.0, 2.2,
\(\cdots \), 5.0] and train and evaluate ResNet-18 models on CIFAR-10 and CIFAR-100. The corresponding overall error rates (ERR) are shown in Fig.
10 (a) and (b), respectively. We observe that for almost all values of
\(\gamma \),
\(\zeta \)-
mixup achieves lower or equal error rate (ERR) than
mixup, thus supporting our claims with our results on 26 datasets that performance gains with
\(\zeta \)-
mixup are achievable for all values of
\(\gamma \ge \gamma _{\textrm{min}}\).
To further understand the effect of
\(\zeta \)-
mixup augmentation on model optimization in the presence of weight decay, we perform another extensive hyperparameter study: we observe model performance by varying both
\(\gamma \) and the weight decay (
\(L_2\) penalty) for SGD. We sample the hyperparameter
\(\gamma \) from a uniform distribution over [1.0, 6.0] and the weight decay from a log-uniform distribution over
\([5e-5, 1e-3]\), and use Weights and Biases [
103] to perform a Bayesian search [
104‐
107] in this space. We train and evaluate ResNet-18 models on the CIFAR-10 and CIFAR-100 datasets. For each of the two datasets, we train 200 models, each optimized with a different combination of
\(\gamma \) and weight decay. To visualize the results, we plot three values:
\(\gamma \), weight decay, and final test accuracy of the resultant model using parallel coordinates plots [
108,
109] (Fig.
10c, d). Models trained with
\(\gamma < \gamma _{\textrm{min}}\) are shown in light gray.
The parallel coordinates plots can be read by following a curve through the 3 columns, where each curve denotes an experiment with the values of, in order left-to-right,
\(\gamma \), weight decay, and test accuracy. For all columns, a lighter color indicates a higher value. We observe that the best performing models (i.e., the curves with the lightest shades of yellow) emanate from smaller values of
\(\gamma \) (i.e., approximately in the range of
\([1.72865, 4.0]\)) and larger weight decays (approximately in the range of
\([5e-4, 1e-3]\)). On the other hand, larger values of
\(\gamma \), which lead to data distributions similar to the “vanilla” distribution (Fig.
2a), yield lower classification accuracies (i.e., the curves with dark purple colors), validating our hypothesis that the augmented samples do not considerably explore the space around the original samples.
Finally, to understand the individual contribution of each of the two components of
\(\zeta \)-
mixup: the mixing of all the samples in a batch (i.e.,
\(T=m\) original samples; Eq.
5) and sampling of weights from a normalized
p-series for the original samples (Eq.
6), towards its superior performance, we perform the following ablation study. We train models with one of these components removed at a time, and study the effect on the downstream classification performance. For this, we use the CIFAR-100 dataset because of its large number of classes (100) and use the experimental settings from "
Evaluation on downstream task: classification" and Table
4: ResNet-18 architecture trained for 200 epochs with an initial learning rate of 0.1 decayed by a multiplicative factor of 0.2 at 80, 120, and 160 epochs,
\(\gamma = 2.8\), and
\(m=128\). The quantitative results for this ablation study are presented in Table
8. To begin with, note that
mixup is a special case of
\(\zeta \)-
mixup (Theorem
2) where the former uses neither of the aforementioned components. Then, we modify
mixup to mix samples using the proposed weighting scheme (Eq.
6) while retaining
mixup ’s choice of mixing only 2 samples. This results in an improved performance over
mixup. For the next experiment, we mix the entire batch (i.e.,
\(T=m\)) but with weights sampled from a Dirichlet distribution
\(\textrm{Dir} (\varvec{\alpha })\) with
\(\varvec{\alpha } = [1.0, 1.0, \cdots 1.0]\), since this is a multivariate generalization of the Beta(1.0, 1.0) distribution-sampled weights used for
mixup. Unsurprisingly, we observe that mixing a large number of samples (
\(m=128\)) with a weighting scheme that does not have a large weight assigned to a single sample results in very poor performance. Such a weighting scheme violates one of the desirable properties of an ideal augmentation method ("
\(\zeta \)-mixup Formulation"), since the synthesized samples will be generated away from the original samples, leaving the original data manifold (Fig.
1) and therefore exhibit a higher local intrinsic dimensionality (Fig.
6) and lower realism. Finally,
\(\zeta \)-
mixup, which uses both of these components, outperforms all these methods.