Introduction
-
What are the various types of learning techniques utilized in DL, and how do they differ in their effectiveness in addressing the challenges of data scarcity?
-
What are various DL architectures?
-
What are the most effective solutions to address the issue of data scarcity in DL, and how do these solutions perform in comparison to traditional data augmentation techniques, such as transfer learning and generative models, in various applications such as image classification, natural language processing, and speech recognition?
-
How can the use of the listed solutions to address limited training data in DL be applied to various sub-applications, and what are the challenges and potential solutions for collecting new data in these areas?
-
What are the most effective pre-training and testing tips for utilizing datasets in DL, and how do they impact the accuracy and efficiency of DL models?
-
What are the best practices and guidelines for reporting datasets used in DL, and how can they improve the reproducibility, transparency, and reliability of DL research?
-
How can trustworthy training datasets be defined, identified, and evaluated for use in DL, and what are the implications of using such datasets on the accuracy, fairness, and ethical considerations of DL models?
-
To the best of our knowledge, this is the first comprehensive review that studies the importance and the main aspects of training data for DL.
-
Learning techniques and DL architectures are explained in detail.
-
Several approaches dealing with data scarcity are accordingly introduced including Transfer Learning (TL), Self-supervised learning (SSL), Generative Adversarial Networks (GANs), and model architecture. Furthermore, alternatives that help to deal with the lack of training data are reviewed, including the concepts of a Physics Informed Neural Network (PINN) and DeepSMOTE.
-
It is provided several tips about the data before training the DL models. These tips help to achieve a full understanding of what the researchers need to know before progressing to any further training stage.
-
It provides a list of typical applications in which DL has been less explored regarding how to deal with scarcity data. An analysis about why those applications did not carry out a suitable study of data for training is also given. Typical applications include electromagnetic imaging, civil structural health monitoring, meteorology, medical imaging, wireless communications, fluid mechanics, microelectromechanical systems, and cybersecurity. Moreover, different alternatives are provided in order to tackle with the scarcity data issue in a more suitable manner.
-
This review offers suggestions regarding how to properly report the dataset when using DL.
-
Finally, the key requirements for a trustworthy training dataset for DL have been discussed.
Survey methodology
Types of learning
Learning problems
Hybrid learning problems
Statistical inference
Learning techniques
Deep learning architectures
Deep neural network (DNN)
Convolutional neural network (CNN)
Recurrent neural network (RNN)
Deep autoencoder network (DAN)
Deep belief network (DBN)
Deep Boltzmann machine (DBM)
Deep conventional-extreme learning machine (DC-ELM)
Deep stacking networks (DSN)
Long short-term memory/gated recurrent unit networks (LSTM/GRU)
Graph convolutional network (GCN)
Lack of training data: issues and solutions
Transfer learning (TL)
-
What is transfer learning?When applied in DL, TL denotes the reuse of existing models to address a new problem. Far from being a typical DL algorithm, TL recycles knowledge from prior training to execute model training. In relation to past trained activity; selected features are classified into certain file types in the new task. High-level generalization is needed for the initially trained model, so new data can be adapted [128, 129, 219]. Training does not begin from scratch for each new task in TL. Classifying massive datasets is time-consuming, especially when DL algorithm is applied. Thus, a DL model training using TL with a classified dataset at hand can be used for the same task involving unclassified data. For example:
-
Riding a motorcycle \(\Rightarrow\) Driving a motorcar.
-
Playing a classic guitar \(\Rightarrow\) Playing the bass guitar.
-
Learning mathematics and ML \(\Rightarrow\) Learning DL.
-
-
What is transfer learning used for?The use of TL in the DL model is to train the system for solving new tasks with massive resources. Certain related fractions from a present DL model are used to address a new, but similar problem. Generalization is integral in TL; only knowledge transfer is viable for another model in other settings. As models with TL have more generality and are not linked rigidly to any training data. These models may be applied for varying datasets and scenarios [220]. Let’s take image categorization as an example: Identifying and categorizing images can be done using DL. With TL, the model may be used to detect other specific objects within the context of images only. Resources are saved as the primary aspects are retained, such as determining object edge in images. This knowledge transfer dismisses model re-training to obtain a similar output. Hence, TL is mostly applied for the following:
-
Saves resources and time as training DL models need not begin from scratch to do the same task.
-
Overcome inadequate data issues for training purposes as TL permits the use of the pre-trained model.
-
-
How does transfer learning work?When TL is used in DL, fractions of the pre-trained DL model are used for the new, yet the same problem or certain new elements are incorporated into the model to address a specific task. Model parts relevant to the new tasks are determined and retained by the programmer. If the process of detecting objects is the task in a new model, a re-trained model for that very similar task may be applied [221, 222]. Training is given to supervised DL models to execute certain tasks from classified data. Upon feeding input and desired output data to the algorithm, only then the model can reckon the pattern and learn trends regarding the new dataset. Such a model yields accurate output within a similar setting, but the model accuracy may be affected if the setting changes beyond the training dataset. This issue is addressed using the TL approach by transferring the related knowledge from an existing model to a new model with the same task. Transfer of general model aspects is crucial for task completion so that the desired output is identified. Tasks can be performed optimally in a new setting when additional layers of definite knowledge are included in the new model [223‐225].
-
Benefits of transfer learning for DLNotably, TL offers many advantages for DL models in training new models [23, 127]. The TL facilitates model training using unclassified data, as the pre-trained model is used. Some of the benefits are:
-
Dismissing huge set of classified training data for new model
-
Enhancing the efficiency of developing and deploying the DL for multiple models
-
Leveraging algorithms to resolve new problems and offering generality when solving a deep problem
-
Simulation is used for model training rather than using actual data
The details of the benefits are:1.Saving on training dataA massive amount of data is needed to train the DL algorithm accurately. Classified training data consumes much time, expertise, and effort for creation. In TL, pre-trained models are deployed and this minimizes the amount of data needed for new DL models. This means that training in TL approach uses existing classified data, which are later deployed for similar but unclassified data.2.Efficient training of multiple modelsProper training of DL models to execute intricate tasks can be time-consuming. However, integration with TL dismisses starting from scratch when a similar model is needed; signifying that the time, effort, and resources spent on DL algorithm training can be used for other varied models. The reuse of similar aspects and knowledge transfer from a prior model ensures an efficient training process.3.Leverage knowledge to solve new challengesAs a popular model, supervised DL offers high accuracy after receiving adequate training to perform tasks with classified training data. As the performance may degrade when data deviate, TL is used to apply existing models for the execution of a similar task, instead of developing a whole new model. The blended approach may be employed with TL as varied other models can be used in seeking of the solution to a problem. Knowledge sharing among models yields a powerful model that generates accurate output. Such an approach permits an iterative way of developing a functional model.4.Simulated training to prepare for real-world tasksFor simulated training, TL is an imminent aspect of the DL model because digital simulations saves both time and cost especially when models are trained to resolve real-world problems. As simulations reflect reality, these models can be adequately trained to detect the desired objects in the simulation. Reinforcement of DL models can be effectively executed using simulations, whereby these models can be trained in any desired setting or condition. For instance, the implementation of the self-driving system in cars establishes simulation as an integral step. As initial training in the real world may not yield expected results, simulations are more viable before the knowledge is transferred to reality. -
-
Transfer learning strategiesVarious TL techniques can be employed based on data availability, domain application, and specific tasks [226, 227] (Fig. 7).×The following describes TL techniques categorized based on conventional DL algorithms:1.Inductive TL: target and source domains are similar, but differ in the task. The inductive bias of the source domain is applied by the algorithms to enhance the target task. Regardless of un- or classified data, the two categories of this approach are self-taught and multitask learning types [228].2.Unsupervised TL: similar to inductive TL, its focuses on unsupervised tasks in the target domain. The tasks differ despite similar target and source domains. Classified data are absent in both domains [229].3.Transudative TL: both target and source tasks are the same, but the domains differ. The source domain has many labeled data, but none in the target. The method is based on feature space or marginal probability [230].The listed transfer classifications denote three TL settings. The following approaches explain the transfer that revolves around the three TL categories:1.Instance transfer: an ideal idea is knowledge reuse from the source domain to the target task. Although the source domain cannot be directly reused, certain fractions may be reapplied with the target data to enhance output [231].2.Feature-representation transfer: error rates and domain divergence are minimized in this method by using good data representations from source to target domains. Based on the presence of classified data, un- or supervised techniques can be deployed for this type of transfer [232].3.Parameter transfer: in this transfer type, the models have similar parameters of prior hyper-parameter dissemination. Dissimilar from multitask learning (source & target tasks learned concurrently); extra weight-age is applied in TL for target domain loss to enhance performance [233].4.Relational-knowledge transfer: in this transfer type, dependent data with identical distribution is managed. This transfer is applicable for a data point related to another one, e.g., social network data [234].
-
Types of deep transfer learningAt times, it is difficult to distinguish TL from multitask learning and domain adaptation mainly because these methods attempt to resolve similar problems. Therefore, TL is reflective of a general concept that is applied to solve a task via task domain knowledge application.1.Domain adaptationIn this domain, the marginal probability between target and source domains differs, e.g., \(P(X_{s})\ne P(X_{t}))\). The integral shift in data dissemination of target and source domains needs alterations in learning transfer. For example, the corpus of movie reviews labeled negative or positive differs from that of product reviews—the classifier to train movie reviews will sense variation when classifying item reviews. Therefore, domain adaptation suits the TL approach in these examples [235‐239].2.Domain confusionBesides highlighting the efficacy of feature-representation transfer, DL layers that capture feature sets can enhance transfer across domains and determine imminent domain-invariant aspects. It is crucial to ensure that both domain representations are near- or similar to enable effective learning. In order to do so, some pre-processing steps are required, as elaborated by Sun et al. in their paper [240], as well as Ganin et al. in [241]. Essentially, an additional goal is added to the source domain to ascertain similarity, thus causing domain confusion.3.Multitask learning4.Zero-shot learningAn extreme DL variant, zero-shot learning uses unclassified data for learning to make modifications at the training phase to exploit extra data so that hidden data can be comprehensible. In a book entitled Deep Learning, Goodfellow and co-authors discussed zero-shot learning based on three variables: conventional input and output variables (x & y, respectively), as well as a random variable that denotes the task (T). This model is trained to master conditional probability distribution; P(y|x, T). This learning type is suitable for machine translation, where the label is absent in the target language [244‐246].5.One-shot learningAs DL models need plenty of training data to learn weights, Deep Neural Networks (DNNs) are unsuitable. For example, a child exposed to an apple would be able to identify a variety of apples—but this is not the case for DL and ML approaches. A variant of TL, one-shot learning yields output with one training instance; thus suitable for actual settings with the absence of classified data for many scenarios (classification task) and for conditions that require the addition of new classes. In an article by Fei-Fei et al. [247], the term ‘one-shot learning’ was coined to describe a Bayesian framework variation that represents learning for the classification of objects. Since its emergence, this approach has been enhanced and applied in DL models [248].6.Few-shot learningThis type involves training models to recognize new objects or classes with only a few examples, typically ranging from 1 to 10 examples per class. In other words, the goal of few-shot learning is to enable machines to learn quickly and efficiently with limited data. on the other hand, one-shot learning is a specific case of few-shot learning where the model is trained on only one instance per class. One-shot learning is considered a more challenging task than few-shot learning because the model must generalize well from a single instance, whereas few-shot learning allows for a small number of examples to be used for training. The challenges of interpreting multimodal time-series data from drone and quadruped robot platforms for remote sensing and photogrammetry have been discussed [249, 250], due to the expensive and time-consuming nature of data annotation in the training stage. The authors proposed a few-shot learning architecture based on a squeeze-and-attention structure that is computationally low-cost and accurate enough to meet certainty measures. The proposed architecture was tested on three datasets with multiple modalities and achieved competitive results. This study demonstrated the importance of developing robust algorithms for target detection in remote sensing applications, using limited training data.
-
Transfer learning approachesThe two TL methods are feature-extraction and fine-tuning [251‐253].1.Feature-extractionHere, a well-trained CNN model is deployed to extract features for the target domain from a massive dataset, such as ImageNet. All completely connected layers in CNN models are discarded and all convolution layers are frozen. The latter layers are the feature extractor that adapts to new task. The extracted features are fed to the classifier form supervised ML or completely connected layers. lastly, only a new classifier is used to train, instead of the whole network, for the training process [254, 255].2.Fine-tuningThis method is similar to feature extraction, except that the convolution layers of well-trained CNN are not frozen but their weights are updated during the training phase. Thus, the weight of convolution layers is initialized with CNN’s pre-trained weights when the classifier layers are initialized with random weights. Here, the whole system undergoes training [164, 256].
-
Research problem in transfer learning for medical imagingOne of the solutions to address the lack of training data is employing the pre-trained models of ImageNet for the target task. For some applications, this type of TL from ImageNet has significantly improved the results compared with training from scratch [257, 258]. However, for some other applications such as medical imaging applications, this type of TL from ImageNet does not help to address the issue of lack of training data. This is due to the mismatch in learned features between the natural image, e.g., ImageNet (color images), and medical images (gray-scale images such as MRI, CT, and X-ray) (see Fig. 8) [213, 259].×These models of ImageNet were designed to classify 1000 classes. However, medical images are ranging between 2 and 10 classes. Therefore, it results in the use of deeply heavy models.It has been proven that different domain of TL (such as ImageNet) does not significantly affect performance on medical imaging tasks, with lightweight models trained from scratch performing nearly as well as standard ImageNet models [260]. To end that, Alzubaidi, et al. proposed two different types of novel TL which effectively showed excellent results in several medical applications [23, 124]. One of the solutions was based on training the DL model on a big number of unlabelled images of a specific task then the model will be trained on a small, labeled dataset for that same task. This approach guarantees that the model will learn the relevant features and reduce the effort of the labeling process. It will offer the chance to use a shallow model with the desired input size. By using the same approach, several published articles have improved the effectiveness of these solutions for medical images and other domains [22, 123, 164, 261‐265].Another solution was proposed by Azizi et al. [70] to improve the learned features of DL models by training them on a large number of unlabelled images of a specific task then the models will be trained on a small, labeled dataset for that same task.Figure 9 demonstrates the comparison of two models trained for the detection of shoulder abnormalities from our ongoing work. The first column is the original images with a red circle which is the region of interest marked by an expert. The second column is a model trained after TL from ImageNet while the third one is a model trained after TL from the same domain TL of the target dataset. As shown in the first row, both models correctly predicted the image based on their confidence values. However, the heatmap reveals that the first model is biased and inaccurate, failing to detect the region of interest indicated by the red circle. In contrast, the second model accurately identified the region of interest with a high confidence value. The second row illustrates that the first model missed the classification, while the second one correctly classified the sample. This example highlights the importance of the source of TL, as even a model with correct confidence values may not be trusted.×
-
Instances of transfer learning for deep learningThe TL has been applied in many areas within the DL field and real world applications, e.g., enhancing computer vision and NLP. The following describes some instances of TL used in DL.There are more domains that used TL to address the issue of lack of training data as listed in Table 1.1.Transfer learning in NLPThe capability of a system to analyze and comprehend human language (text/audio files) is NLP—to enhance human-system interaction. In fact, NLP is crucial for daily activities, including language contextualization tools, voice assistants, translations, speech recognition, and automated captions. Many DL models with NLP can be enhanced with TL, such as adding pre-trained layers that identify vocabulary or dialect and concurrent model training to identify language aspects. The method of TL can be used for model adaption across multiple languages. Models trained and refined in one language may be adapted for other similar languages. With vast English digitized resources, the models may be trained using a massive dataset before transferring the aspects to another language [266‐272].2.Transfer learning in computer visionThe capability of a system to make meaning from visual formats (images/videos) is known as computer vision. A massive volume of images is trained for DL models to reckon and group the images. Here, TL recycles elements of the computer vision algorithm for application in the new model. The accurate models generated via TL from training with massive data can be applied effectively for smaller image sets or even more general aspects (e.g., detecting object edges). Essentially, a specific model layer that detects objects/shapes can be trained. While refining and optimizing the model parameters, the TL sets the model functionality [273‐275].3.Transfer learning in neural networkThe ANN is a crucial element in DL for simulating and replicating human brain functions. Notably, NN training usurps plenty of resources due to model intricacy. In fact, TL is crucial to minimize the use of resources and ascertain an efficient process. The development of new models includes the transfer of features or knowledge across networks. The use of knowledge in varied settings is a vital aspect of network building. Essentially, TL is typically limited to general tasks or processes that stay relevant in an assortment of scenarios [214, 215, 276].4.Transfer learning for Audio/SpeechThe DL model, similar to computer vision and NLP, can be applied to audio data. Models called Automatic Speech Recognition (ASR) formulated for the English language are broadly applied to enhance the performance of speech recognition in other languages. Another instance of TL application refers to automated speaker identification [177, 277, 278].Table 1Some examples of TL from the literatureFieldLiteratureIndustrial automationMedical imagingWireless communicationsPlant diseasesMachinery faultSoftware defectActivity recognitionObject detectionInternet of Things
-
The future of transfer learningWidespread access to more powerful models formulated by conglomerates and related organizations dictate the future of DL models. It is crucial that the DL is adaptable and accessible to organizational demands and goals to revolutionize processes and businesses. However, only a handful of organizations possess the resources and expertise to train models and classify data. One challenge faced by supervised DL is obtaining a massive amount of classified data. Classifying countless data is labor-intensive and access to most data appears prohibitive to developing powerful models. With access to many classified data and resources, organizations can effectively develop algorithms. However, when used in other organizations, the model performance may differ due to environmental and training change impacts. Even the most accurate models would results in performance degradation in a different setting—a hindrance to DL when shifting to mainstream application. Imminently, TL has a significant function in resolving the said barrier. By integrating TL, the DL models can turn more powerful due to their ability to carry out specific tasks and settings. Hence, TL is denoted as an imminent driver for distributing DL models across new fields and areas.
Self-supervised learning
-
Pretext tasks: these are tasks that are designed to generate labels for the data, which can then be utilized to train a DL model. Examples of pretext tasks include predicting the rotation of an image, predicting the next frame in a video, and predicting the mask for an image.One example of using a pretext task for SSL is the work done by Doersch et al. [341]. The authors trained a CNN to predict the relative location of randomly selected patches within an image. The CNN learned useful features from the images that could then be utilized for other tasks.
-
Autoencoders: these are neural networks that are trained to reconstruct their input data. They are often utilized as a way to learn useful features from the data, which can then be utilized for other tasks.An example of using autoencoders for SSL is the work done by Masci et al. [342] where the authors trained a stacked autoencoder to learn features from images of faces. The learned features were then used to train a classifier to recognize the identities of the faces.
-
Generative models: these models are trained to generate new data that is similar to the training data. Examples include Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) which will be explained in the next section.An example of using generative models for SSL is the work done by Goodfellow et al. [343]. They trained a GAN to generate synthetic images that were similar to a dataset of real images. The generated images were used to train a classifier to recognize objects in real images.
-
Contrastive learning: this SSL technique involves training a model to distinguish between different types of data. The model is then fine-tuned on a downstream task using the learned features.An example of using contrastive learning for SSL is the work done by He et al. [344] where they trained a CNN to distinguish between different types of images and used the learned features to train a classifier on a downstream task.
-
Self-supervised multitask learning: this technique is based on training a single model on multiple tasks simultaneously, using a combination of supervised and unsupervised learning. The model learns to solve multiple tasks using the shared features learned from the unsupervised tasks.An example of using self-supervised multitask learning is the work done by Caruana et al. [345]. The authors trained a single neural network to perform multiple tasks simultaneously, using both supervised and unsupervised learning. The network learned to solve the tasks using the shared features learned from the unsupervised tasks.
Generative adversarial networks (GANs)
-
Variants of GANEnhancements made to GAN architecture (Fig. 11) are explained in the following:×1.Fully connected GANsThe initial GAN MA had full NN connections for D and G [348]. This MA was applied for the detection of simple images, e.g., the Toronto Face dataset (TFD), MNIST, and CIFAR10 (natural images).2.Conditional GANs (CGAN)Upon extension, D and G networks are conditioned on additional data (y) to overcome reliance on random variables in the original model [353]. y denotes auxiliary data from other modalities or class labels. Conditional data are used by feeding y into G and D networks as an extra input layer (see Fig. 12). In the G network, prior input noise pz(z) and y are integrated in joint hidden representation, while the adversarial training framework permits considerable flexibility in the composition of this hidden representation [353]. In the D network, both x and y are presented as inputs to a D function.×3.Laplacian pyramid of adversarial network (LAPGAN)Using a cascade of convolutional networks with the LAPGAN model, Denton et al. [354] introduced image generation in a coarse to fine manner. Hence, a multiscale structure of natural images could be exploited to build GAN models by taking a certain level of image structure based on LAPGAN. Built from the Gaussian pyramid, the Laplacian pyramid uses these functions: downsampling d(.) and upsampling u(.). Let G(I) = \([I_{0};I_{1}; \ldots ; I_{K}]\)be Gaussian pyramid, where \(I_{0}\) = I while \(I_{k}\) denotes repeated k of d(.) to I. Laplacian pyramid’s coefficient \(h_{k}\) (level k) signifies the variance among adjacent levels within the Gaussian pyramid, in which unsampling has a smaller value with u(.) (Eq. 3).Coefficients of Laplacian pyramid \([h_{1}; \ldots ; h_{k}]\) is reconstructed via backward recurrence, as in Eq. (4):$$h_{k}=L_{k}(I) = G_{k}(I)- u(G_{k+1}(I))= I_{k}-u(I_{k+1})$$(3)Convolutional generative models, which are needed to train LAPGAN, capture coefficients \(h_{k}\) distribution for varied Laplacian pyramid levels. These generative models, during reconstruction, yield \(h_{k}\). Hence, the modification that takes place in Eq. (4) is expressed in Eq. (5):$$I_{k}=u(I_{k+1}+h_{k})$$(4)Training image I is used to constructing the Laplacian pyramid. The stochastic choice is made at every level for coefficient \(h_{k}\) construction via \(G_{k}\) generation or via the standard procedure. The CGAN model is used by LAPGAN by incorporating low = pass image \(\imath _{k}\) to both G and D. The LAPGAN performance was assessed using three datasets: LSUN, CIFAR10, and STL10. The assessment was conducted through the comparisons of human sample examination, log-likelihood, and generated image sample quality.$$\bar{I_{k}}=u((\bar{I_{k-1}})+\bar{h_{k}})= u(\bar{I_{k-1}})+ G_{k}(z_{k}, u(\bar{I_{k-1}}))$$(5)4.Deep convolutional GAN (DCGAN)A new class of CNN was initiated by Radford et al. [355] called DCGANs that can resolve the following architectural issues noted in CNN MA:
-
Hidden layers that are completely connected are discarded, while pooling layers are substituted with fractional- and stridden convolutions on G and D, respectively.
-
Batch normalization is applied for both G and D models.
-
ReLU and LeakyReLU activation is used in G (except the final layer) and D layers, respectively.
The G in DCGAN used in LSUN sample scene modeling is portrayed in Fig. 13. Its performance was compared with that of SVHN, LSUN, CIFAR10, and Imagnet 1K datasets. First, DCGAN was used as a feature extractor to determine the quality of unsupervised representation learning, followed by the determination of accuracy performance by fitting a linear model above the features. Notably, G displayed the ability to disregard some elements of the scene, e.g., furniture and windows. Good outcomes were noted when vector arithmetic was executed on face samples.×5.Adversarial autoencoders (AAE)The AAE, which was proposed by Makhzani et al. [356], refers to a probabilistic autoencoder that applies GAN to carry out variational inference. This is done by matching arbitrary prior dissemination with aggregated posterior of hidden code vector in autoencoder. The autoencoder in AAE undergoes training with two aims—criteria for conventional reconstruction error and adversarial training. Next, conversion of the data distribution to the prior one is learned by the encoder at post-training. The decoder, on the other hand, learns the deep generative model that portrays that prior to data distribution (Fig. 14). The MA of AAE is given below: Where x and z are the input and latent code vectors of autoencoder. p(z), q(z|x), and p(x|z) reflect imposed prior, encoding, and decoding distributions, respectively. Next, pd(x) and p(x) signify data and model distributions, respectively. The aggregated posterior distribution of q(z) on hidden code vector of the autoencoder is defined as q(z|x) (autoencoder encoding function), as expressed in Eq. (6):Regularisation of autoencoder in AAE is performed by matching arbitrary prior p(z) with aggregated posterior q(z). The adversarial G network serves as an encoder for autoencoder q(z|x)). Both autoencoder and adversarial networks are jointly trained with gradient descent in reconstruction and regularisation stages. Both the encoder and decoder are updated by the autoencoder in the reconstruction stage to minimize input glitches. The D is updated by an adversarial network in the regularisation stage to distinguish true samples from fake ones, and followed by a generative model update to confuse D. During the adversarial training, AAE includes labels as well to offer a better distribution shape for hidden code. Single-hot vector, which is included in discriminative network input to link distribution mode with the label, is a switch that chooses a decision boundary based on a class label for a discriminative network. The vector has an extra class related to unclassified data. This extra class functions when unclassified data are found so that the decision boundary can be chosen for full Gaussian distribution.$$q(z)=\int _{x}q(z|x)p_{d}(x)dx$$(6)×6.Generative recurrent adversarial networks (GRAN)The GRAN, introduced by Im et al. [357], has recurrent computation, produced from unrolled optimization based on gradient, which incrementally develops images for visual canvas (see Fig. 15). Current canvas images are extracted from a convolutional network encoder. The decoder is fed with generated and reference image codes to decide on canvas updates. Functions f and g are GRAN decoder and encoder, respectively. The G in GRAN has a recurrent feedback loop, which receives noise samples sequence from \(z \sim p(z)\) prior distribution, to draw results for varied time steps; \(C_{1}\); \(C_{2}; \ldots ;\)\(C_{T}\) Sample z from prior distribution is moved to function f(.) at time step (t) with hidden state \(h_{c,t}\), where \(h_{c,t}\) is the current encoded status of past Ct − 1 drawing. Ct denotes that drawn at time t on canvas with function f(.) output. Function g(.) mimics function f(.) in inverse. Gathering samples at every time step produces the last sample drawn on canvas, C. Function f(.) is the decoder that accepts noise sample z and past hidden state input \(h_{c,t}\), while function g(.) is the encoder that offers output \(C_{t-1}\) hidden representation for time step t. Dissimilar to the rest, GRAN begins with the decoder.×7.Bidirectional GAN (BiGAN)The BiGAN (see Fig. 16) was proposed by Donahue et al. [358] to learn data distribution inverse mapping and semantics, in which the learned feature representations are re-projected into latent space. Referring to Fig. 9, apart from G deriving from GAN, BiGAN has an encoder E that maps data x to latent representation z. The BiGAN D discriminates not only in data space [x versus G(z)] but jointly in data and latent spaces [tuples (x;E(x)) versus (G(z); z)], where the latent component is encoder output E(x) or G input z. Based on GAN targets, BiGAN encoder E can learn to invert G.× -
-
GAN applicationsThe GAN yields real-like samples with arbitrary latent vector z, thus dismissing the identification of the real distribution of data. Thus, GAN has been used in many academic and engineering fields. This section presents the applications of GANs in terms of generating new data to enhance training set [359‐361].1.Generation of high-quality imagesRecent studies on GAN have enhanced both the usability and quality of image production abilities, such as the LAPGAN model [354] discussed Before. Several publications have addressed the issue of lack of training data using GANs [350, 362‐364].The Self-Attention GAN (SAGAN) was initiated by Zhang et al. [365] to enable long-range, attention-driven reliance modeling that produces images. This is dissimilar from convolutional GAN, which yields details with high resolution for spatially local points within feature maps with low resolution. The SAGAN, which adds cues-generating details from all feature areas, yields excellent outcomes that lowered Frechet Inception Distance (FID) to 18.65 from 27.62 and hiked Inception Score (IS) to 52.52 from 36.8 for the ImageNet dataset.The BigGans was introduced by Brock et al. [366] to yield diverse and high-resolution samples from intricate datasets (ImageNet) by using the largest scale to train GAN. Orthogonal regularisation was used for G to make a ‘truncation trick’ that enables the control of trade-off between sample variety and fidelity by minimizing G input variance. Further alteration enabled the model to synthesize class-conditional images. The model, upon being trained using ImageNet (resolution: 128 \(\times\) 128), scored 166.5 and 7.4 for IS and FID, respectively; which was better than the model described above.A G network for GAN was initiated in light of style transfer [367, 368]. The model displayed several noteworthy outcomes: enabled scale-specific and intuitive synthesis control, automatic learning, stochastic difference noted in the produced images (e.g., hair & freckles), and unsupervised segregation of attributed with high level (identity & pose if trained using human faces). Meanwhile, Huang et al. [369] introduced GANs that operated on intermediate representations and not images with low resolution. This model is similar to LAPGAN with extended CGAN as D and G networks could accept extra labeled data as input—a popular method to date that enhances image quality. In another instance, Reed et al. [370] applied GAN for image synthesis from texts (reverse captioning). To describe, a trained GAN may produce images that match certain descriptions, such as that of the following text: white with some black on its head and wings and a long orange beak. Along with texts, image location can be conditioned using a Generative Adversarial What-Where Network (GAWWN) that incrementally builds big images with the support of an interactive interface and bounding box supplied by user [371]. As for CGAN, besides synthesizing new samples with certain features, it permits users to create tools to edit images [372].For maximizing one/many neurons activation in a segregated classifier network, Nguyen et al. [373] introduced a novel approach that performs new image synthesis via gradient ascent in the latent space of the G network. The extension of this method incorporated extra prior on latent code, which enhanced sample diversity and quality—yielding high-quality images (resolution: 227 \(\times\) 227) for all ImageNet data [374]. Additionally, Plug and Play Generative Networks (PPGNs) were introduced possessing (1) G network that draws multiple image types and (2) a substitutable condition network that informs what G should draw. As a result, the images were conditioned on the caption (C = image captioning network) and class (C = ImageNet/MIT Places classification network).Next, the GAN model was used by Salimans et al. [375] to execute training with novel features based on two aspects: semi-supervised learning and the production of visually-realistic human images. This model yielded accurate outputs using semi-supervised classification on SVHN, MNIST, and CIFAR10. Based on the Turing test, the produced images were verified of having high quality. While the CIFAR10 samples displayed a 21.3% human error rate, those of MNIST were near-similar to real data.Wasserstein GAN (WGAN) was used by Huang et al. [376] for density reconstruction in dynamic topography. Wasserstein GAN was proposed by Arjovsky et al. [377] to enable stable training but ended up failing to converge and producing poor samples. These issues, according to Gulrajani et al. [378], were due to clipping weight to apply the Lipschitz constraint on the critic. Alternative clipping weights were, thus, used to penalize the norm of critic gradient based on input. This resulted in better training for multiple GAN MAs with nearly nil hyperparameter tuning, inclusive of language models with continuous G and 101-layer ResNets, as well as high-quality yields on LSUN and CIFAR10. Based on what was discussed above, we believe GAN is an effective solution to generate more data to address both lack of data and imbalanced data [359‐361, 379, 380].2.Image inpaintingMissing parts reconstruction in images, or image inpainting, makes the reconstructed areas undetectable. Hence, damaged areas are restored and undesired objects are discarded in images. GANs have been applied to address this issue [381‐384].The recent DL approaches have the ability to solve missing parts in images via the image inpainting technique, thus yielding perfect image textures and structures. Inferring arbitrary huge missing image parts via image semantics is called ‘semantic inpainting’ [385, 386]. The demand for high-level context prediction poses more difficulty in this method when compared to image completion or past inpainting methods that eliminate whole objects and address inauthentic data corruption.A method based on a deep generative model was initiated by Yu et al. [387] to apply surrounding image characteristics and synthesize image structures for better prediction. This CNN feed-forward model process varied-sizes and multi-hole images at random areas during the testing phase. Experimental work involving natural images (Places 2 & ImageNet), textures (DTD), and face samples (CelebA & HQ) revealed that the introduced model yielded higher-quality inpainting outcomes. Another study introduced an inpainting system in the DL model to complete images using inputs and free-form masks [388]. Using gated convolutions, the system learned millions of unlabelled images to address vanilla convolution problems (generalized partial convolution & input pixels being valid) by offering a mechanism to learn dynamic features for channels across all layers at each spatial region.A GAN loss model (SN-Patch GAN) using D with normalized spectral on patches of dense images [388] is rapid, non-intricate, and offers stable training. The extended version and automatic image inpainting revealed more flexible and higher-quality yields. Using edge G and followed by an image completion system, Nazeri et al. [389] built a model with a double-stage adversary. Missing region edges in images are hallucinated by edge G, and these edges are filled via the image completion system as a priori. The model was assessed using Paris Street View, CelebA, and Places2 datasets.A new semantic image inpainting model was proposed by Yeh et al. [390] based on GAN MA, whereby semantic inpainting was viewed as an issue of image generation. Their adversarial model [391, 392] had been trained to seek encoding of corrupted image ‘closest’ to the target image in latent space. Next, the image is reconstructed using G via encoding. ‘Closest’ is the loss of weighted context in the corrupted image and unrealistic images penalized via prior loss. In comparison to CE, this approach dismisses masks for training and can be applied for randomly-structured missing areas at the inference phase. This technique was assessed with CUB-Birds [393], CelebA [394], and SVHN [395] datasets with varied missing areas. The method gave more realistic images than other approaches.3.Super-resolutionUpscaling images or videos require super-resolution, as it upgrades low-resolution images to high resolution by incorporating realistic image details at the training phase [396‐398]. For instance, a new training approach was initiated by Karras et al. [399] to progressively grow G and D; begin at low resolution, and new layers are increasingly included to model fine details during training. This approach offer better speed and stability while training, thus generating high-quality images using CelebA.The extension of prior models, the SRGAN approach [400], is embedded with an adversarial loss element that constrains images to stay on the manifold of natural images. Imminently, the G in SRGAN holds low-resolution images and infers natural realistic images with a four-time scaling factor. Adversarial loss, dissimilar from other GAN models, is an aspect of the larger loss function that incorporates permanent loss from a pre-trained classifier, as well as regularisation loss that yields images that are spatially coherent. The entire solution is constrained by adversarial loss to manifold natural images, thus generating better solutions. Access to curated training data is a hindrance to DL model customization. Nonetheless, SRGAN customizes specific domains in a straightaway manner because new training image pairs are constructed easily by down-sampling high-resolution image corpus. Essentially, the image domain in the training set dictates the yield of GAN with realistic details.To improve SRGAN visual quality, Wang et al. [401] assessed the following three elements: perpetual loss, network architecture, and adversarial loss—the initiation of Enhance SRGAN (ESRGAN). The fundamental network building unit is composed of Residual-in-Residual Dense Block (RRDB) in the absence of batch normalization. The very idea derived from relativistic GAN, which enables D to predict, rather than absolute value, but corresponding realness. To gain stronger supervision for texture recovery and brightness consistency, the perpetual loss was enhanced with features prior to activation. The ESRGAN gave higher visual quality with more natural and realistic texture than SRGAN—champion in PIRM2018-SR Challenge (region 3; the best perceptual index).As many techniques end up yielding low-quality and low-resolution images in real scenarios, Bulat et al. [402] introduced a two-stage process: (1) High-to-Low GAN is trained to learn down-sampling and degrading images with high-resolution, and (2) the network output is applied to train Low-to-High GAN in order to generate images with super-resolution.4.Video prediction and generationAn issue in computer vision is comprehending scene dynamics and object motion. A model is needed for scene transformation in video generation (prediction of the future) and recognition (grouping of actions). Building this model is, however, not easy due to motion in scenes and objects [403, 404]. A GAN for the video was proposed by Vondrick et al. [405] to untangle the scene foreground from the background via spatiotemporal convolutional architecture. In predicting the future of static images, the proposed model could produce a 1-s short video at a complete frame rate, which is better than a simple baseline. Further assessment revealed that the model could learn features to reckon actions at minimum supervision—scene dynamics are viable for representation learning. Several works were proposed for same purpose using GANs [404, 406, 407]The Motion and Content decomposed GAN (MoCoGAN) was introduced by Tulyakov et al. [408] to yield videos. Videos are made by generating a sequence of random vectors [with content (fixed) & motion (stochastic) parts)] to that of video frames. Using video and image Ds, a new adversarial learning mechanism was devised to learn content and motion decomposition unsupervised. The model efficacy was verified empirically via quantitative and qualitative approaches. This approach has been improved in different ways [360, 404, 407].5.Anime character generationApart from requiring experts for routine tasks, animation production and game development are costly. Anime characters can be colorized and auto-generated using GAN [409‐413]. These G and D have multiple ReLU with skip connections, convolutional layers, and batch normalization. The CartoonGAN, a solution that transforms real-world photos into cartoons was initiated by Chen et al. [414] for computer graphics and computer vision applications. The easy training phase involves cartoon images and unpaired photos. The two losses for cartoon styling are (1) semantic content loss (sparse regularisation for high-level feature maps of VGG network to cope with photo-cartoon style variation) and (2) edge-promoting adversarial loss (preserves clear edges). To automatically generate anime characters, Jin et al. [411] combined GAN training methods and a clean dataset to yield realistic facial images. The SRResNet was modified to a G model (see Fig. 17) that applies 3 subpixel CNN (to upscale the feature map) and has 16 Res-Blocks. The architecture of D displayed in Fig. 17 has 10 Res-Blocks. Due to correlations in mini-batch that lead to unwanted gradient norm calculation, layers of batch normalization were discarded from D. Additional completely connected layers were added to the final convolution layer as the classifier of the attribute. Weights initialized from Gaussian distribution had 0:02 and 0 standard deviation and mean values. Figure 18 portrays an anime character generated by GAN.××6.Image-to-image translationThe translation of input to output images can be performed using CGAN—a recurring theme in computer vision, computer graphics, and image processing. This pix2pix model resolves these image-related issues [415‐417]. Additionally, a loss function may be devised using the pix2pix model in order to train input-to-output image mapping. It yields exceptional outcomes for varied computer vision problems that demanded black-white image colorization, semantic segmentation, attaining maps from aerial photos, and segregated machines [415].The model was extended to produce CycleGAN [418] by embedding cycle consistency loss that preserves the original image after translation and reverse translation cycle. As paired images are eliminated from the training phase, the data preparation process becomes simpler and is open to other multiple approaches. The artistic style transfer [419], for example, gives a natural image with Monet or Picasso style by training using natural images and unpaired paintings. Novel samples that match the training set can be achieved by GAN, along with style transfer (modifies image visual style), domain adaptation (the generality of new domains with unclassified data in the target domain), and the latest, TL (import of existing knowledge to simplify learning) approaches [420]. Nonetheless, the general analogy synthesis issue is untapped. Hence, Taigman et al. [420] overcame this problem by separating labeled samples from domains T and S, as well as by incorporating a multivariate function (f) for mapping; \(G: S \rightarrow T\) such that \(f(x) \sim f(G(x))\). The DNNs of a certain structure were applied, where G denotes learning (g) and input (f) functions composition. The compound loss that integrates multiple terms was deployed as well. The proposed technique can visual domains (face images and digits) and generate realistic new images from unseen samples, while concurrently retaining identities.A generative network was segregated into two by Chen et al. [421] so that each looks into a subtask alone. The attention network estimated spatial attention maps of images, while the transformation network translated objects. The attention map produced in the initial step is sparse to enable more attention placed on the target object and should remain constant regardless of transfiguration. More instructions are given while learning the attention network due to image segmentation. The outcomes revealed the importance of assessing attention during the transfiguration, whereby the algorithm introduced can learn precise attention to enhance the quality of the produced images.In the Multimodal Unsupervised Image-to-image Translation (MUNIT) model introduced by Huang et al. [422], image representation is decomposed into a content mode (domain-invariant) and style code (detects domain-specific attributes). The translation of an image to another domain involves the recombination of content code with random style code deriving from the target domain. Upon comparing the proposed model with other current models, the latter displayed more benefits.The Exemplar Guided and Semantically Consistent Image-to-image Translation (EGSC-IT) network introduced by Ma et al. [423] can be applied to perform the translation process on samples in the target domain. An image consists of a shared content aspect (shared across domains) and a style aspect (specific to the domain). The Adaptive Instance Normalisation applies the shared content aspect to enable style information transfer from the target domain to the source domain. The concept of the feature was deployed to hinder semantic inconsistency while translation (due to variations of the large inner and cross-domain) and to offer a coarse semantic guide in the absence of a semantic label. The Single GAN was introduced by Yu et al. [424] to execute multi-domain image-to-image translation with single G. In order to ascertain A domain code was deployed to integrate multiple optimization goals and to control varied generative activities. The results for unclassified data revealed superior performance by the proposed model when translating between the two domains. CycleGAN has been used in several applications such as medical imaging and plant diseases to address the issue of imbalanced datasets [425‐428]. Figure 19 shows an example of CycleGAN with CT images.×7.Text-to-image translationOne of the impressive applications of GANs is text-to-image translation [430‐433]. Using GAN, Fedus et al. [434] enhanced sample quality by explicitly training G to yield high-quality samples that displayed successful image production. The actor-critic CGAN can complete missing text conditioned on the context. Evidently, this gave more realistic un- and conditional text samples quantitatively and qualitatively, in comparison to maximum likelihood trained model.With the benefits of automatic synthesis of realistic images from text, Denton et al. [354] applied the Laplacian pyramid with adversarial G and D to synthesize images at many resolutions. Images with high resolution that can condition on class labels were produced with control. Using a standard convolutional decoder, Radford et al. [355] built a stable and effective MA by including batch normalization to attain exceptional image synthesis outcomes.The GAWWN was used by Reed et al. [370] to synthesize images from text descriptions (reverse captioning). Besides conditioning on image location [371], the model supports an interactive interface that increasingly builds up big images with textual descriptions and bounding boxes supplied by the user. As for CGANs, it synthesizes new samples with certain features and enables the development of tools to intuitively edit images, such as hairstyle editing or giving a younger look in images [435]. Figure 20 shows an example of text-to-image translation.×8.Face agingProgression and regression of face age (or face rejuvenation and aging) render face images regardless of aging effect, while simultaneously preserving personalized face features (i.e., personality) [437‐440]. A conditional AAE (CAAE) was initiated by Zhang et al. [441] to learn face manifold. The control of age attribute assures flexibility to gain regression and progression concurrently. Some advantages of CAAE are: (1) gains age regression and progression to produce realistic face images, (2) dismissal of paired samples while training and labeled face while testing—ascertaining model generality and flexibility, (3) disentangled personality and age in latent vector space preserve personality and hinder ghosting artifacts, as well as (4) robust against occlusion, pose, and expression variations as CAAE imposes D on the encoder and G. The D on the encoder and G offer smooth transition in latent space and realistic face images, respectively. Thus, CAAE yields images with higher quality than AAE. The CAAE had been assessed with CACD [442] and Morph [443] datasets.A synthetic aging method was initiated by Antipov et al. [444] for human faces using Age CGAN (Age-cGAN), comprising of dual steps: (1) input face reconstruction that demands optimization problem resolution to seek optimal latent approximation, (2) and face aging executed via simple conditions change at G input. This approach introduces ‘Identity-Preserving’ latent vector optimization that preserves the original identity during the reconstruction phase, besides modifying other facial features. Figure 21 shows an example of face age.×9.Image blendingMixing of two images is called ‘image blending’, where the output image is combined with input images pixel values and GANs showed an excellent performance [445].The dense image matching method was initiated by Gracias et al. [446] to enable copy and paste of only the related pixels. Significant variances between source images dismiss the model usage. One way is by making a smooth transition to hide artifacts in composited images.The Gaussian–Poisson GAN (GP-GAN), which was introduced by Wu et al. [447], combines the strengths of GANs and approaches based on a classical gradient—The initial study that assessed GAN ability in high-resolution image blending task. The Gaussian–Poisson Equation was developed to address the high-resolution image blending issue—a joint optimization constrained by color and gradient data. Color data are obtained from Blending GAN, which was introduced to learn the mapping between well-blended and composited images; while gradient data are generated from gradient filters. Apart from producing realistic and high-resolution images, the proposed model generated less undesired artifacts and bleeding. The experimental outcomes verified the superior performance of the proposed model over other models using Transient Attributes dataset.
Model architecture
-
Mean Squared Error (MSE) is a popular loss function used in DL for regression problems. It measures the average squared difference between predicted and actual values [453].
-
Mean Absolute Error (MAE) measures the average absolute difference between predicted and actual values [454]. This function is also known for regression problems.
-
Cross-Entropy Loss is known for use of multi-class classification problems. It measures the dissimilarity between the predicted probability distribution and the actual probability distribution of the target variable [455]. It is commonly used in tasks such as image classification and natural language processing.
-
Hinge Loss is commonly used for binary classification problems where is commonly used in support vector machines (SVMs). It encourages correct classification by penalizing incorrect predictions linearly [456].
-
Focal Loss is well-known for imbalanced classification problems. It is designed to give more weight to hard-to-classify examples, reducing the impact of easy-to-classify examples and improving performance on the minority class. It is commonly used in object detection and segmentation tasks [457].
-
Triplet Loss is used for learning representations in siamese networks or other similar architectures. It measures the distance between anchor, positive, and negative samples [458].
-
Contrastive Loss is used to learn the similarity between two inputs, and it penalizes the model for dissimilar inputs and rewards the model for similar inputs [459].
-
Sparsemax Loss is a probabilistic activation function that can be used in classification tasks [460]. It encourages the model to assign low probabilities to irrelevant classes.
-
Kullback–Leibler (KL) Divergence Loss is used for measuring the difference between two probability distributions [453]. It is often used in generative models, such as Variational Autoencoders (VAEs).
-
Huber Loss is used in regression tasks and provides a combination of Mean Absolute Error (MAE) and Mean Squared Error (MSE) loss functions [461].
-
Quantile Loss is known for quantile regression problems. It measures the difference between the predicted quantile and the actual value at that quantile, with a different loss function for each quantile. It is commonly used in financial forecasting and risk analysis [462].
-
Center Loss is used for face recognition tasks and minimizes the distance between the features extracted by the DL model and their corresponding class centers [463].
-
Wing Loss is designed to be robust to outliers by penalizing large errors less than Mean Squared Error (MSE) Loss [464]. It is commonly used in tasks such as facial landmark detection and human pose estimation.
-
Cosine Loss is used to optimize the cosine similarity between two feature vectors in a high-dimensional space. It is commonly used in tasks such as face recognition and image retrieval [465].
-
Imbalanced datasets mean one class has significantly fewer samples than the others. It can be challenging to find a loss function that balances the trade-off between correctly identifying the minority class while not misclassifying the majority class too often.
-
Noisy data can be a challenge when selecting an appropriate loss function. Noisy data can cause the model to learn incorrect patterns, leading to poor performance.
-
Overfitting is an issue when some loss functions are prone to it, especially when the model is too complex or when the data is scarce. Overfitting occurs when the model learns to fit the training data too well, resulting in poor performance on the test data.
-
Optimization challenges can appear in some loss functions that can be difficult to optimize. This can lead to slow convergence or getting stuck in local minima.
-
Model interpretability can be an issue when some loss functions are more difficult to interpret than others which making it harder to understand why the model makes certain predictions.
Physics-informed neural network (PINN)
Deep synthetic minority oversampling technique (DeepSMOTE)
Pre-training and testing tips of using dataset
Applications
Electromagnetic imaging (EMI)
Civil structural health monitoring
Meteorology applications
Medical imaging applications
Wireless communications
Fluid mechanics
Microelectromechanical systems (MEMS)
Cybersecurity: vulnerabilities
-
Vulnerability types: software security, especially the vulnerabilities found in software implementation, is a challenging problem because there are numerous types of security vulnerabilities reported and discovered every year according to the Common Weakness Enumeration (CWE) [605] and Common Vulnerabilities and Exposures (CVE) [606] databases. Most existing efforts focus on a binary classification to detect only a particular type of vulnerability. A model that is trained on, for example, buffer overflows (CWE121 and CWE122), will not be able to detect other types of vulnerabilities such as SQL injections (CWE89 [607]). Therefore, it is desirable to develop a more robust multiple classification-based approaches that can be trained from a dataset with multiple types of vulnerabilities. Vulnerability types in the dataset are essential to detect various vulnerabilities and each dataset should mention how many CWEs or CVEs exist. For instance, there is one CWE in [608], 609 CVEs in [609] and 911 CWEs in [610].
-
Dataset size: The performance of a DL model depends largely on the size of the training datasets. More training datasets provide a larger number of samples that the model can use to learn. It is a well-known problem that there is only a small set of labeled data currently available to train a vulnerability detection model [611]. As a result, the limited number of existing datasets for software security are typically handcrafted test programs that are small and imprecisely labeled. In the future, it will be useful to explore techniques to automatically generate large datasets by either labeling real-world software that exhibits security vulnerabilities or synthesizing datasets to fully recap the vulnerability patterns in real-world programs. In general, the test results for large datasets will be more accurate. For instance, there are 1,274,366 samples in [612] but only 871 samples in [613]
-
Label vulnerabilities: supervised learning is one of the most common DL approaches that has been used in software vulnerability detection. It can perform well with datasets that are properly labeled before the model’s training. Unfortunately, most software security datasets are either unlabelled or imprecisely labeled. These imprecisely labeled datasets can lead to low performance and unreliable vulnerability detection models. Handcrafting labels is not only tedious and labor-intensive but also inconsistent. Many vulnerabilities are not localized and can be caused by multiple parts of the program. It is very challenging to identify the root cause of a vulnerability and manually label it in a consistent way that does not confuse machine learners. For instance, it is necessary to ensure that the labeling of all vulnerabilities of the same type follows the same rule. Therefore, to overcome this problem, researchers can consider building tools to aid the labeling process so that a large set of labeled data can be generated automatically from existing reported vulnerable software. Some datasets are labeled for each CWE or CVE (e.g., SARD [614]), but others are labeled as binary detections only, as vulnerable or not vulnerable (e.g., OSS [615]). Researchers often want ready-made labeled datasets for training due to the cost and expertise associated with manual labeling. This leads to fewer available datasets, the lack of which contributes to the problem referred to above.
-
Synthesise datasets: while there are many software vulnerabilities reported each year (e.g., in CWE or CVE), they may not be sufficient to train reliable detection models. This may be attributed to the fact that, despite the large number of different types of vulnerabilities, there are limited cases of each vulnerability type. More generally, compared with the size of the software, vulnerabilities are rare and often outliers that do not conform to the usual software behaviors. Synthetic datasets are widely used in software vulnerability detection to artificially increase the number of samples that contain vulnerabilities. For example, the Juliet project [614] generated synthetic datasets based on a few predefined patterns. However, synthetic datasets often cannot reflect the structure of real-world vulnerabilities, therefore, cannot represent the diverse behaviors observed in real-world programs [616]. It is better to train the model on a mixed source code dataset (real and synthetic) that is rarely available. For example, Java (1772 real samples [615]) and (28,881 synthetic samples [614]), PHP (2942 real samples [617]), SQL (6,586 real samples [608]). Some datasets have several programming languages such as Python and C/C++ (8027 real samples [618]) and Java, C/C++, C#, and PHP (177,184 synthetic and real samples [610]). Several datasets are available for C/C++ in [602]. Despite that, there exist only a small number of samples to generalize DL models. In the future, more sophisticated program synthesis techniques could be explored to increase the quality and versatility of the samples in generating large synthetic datasets.
-
Generalisation: when the DL model trains on an old dataset, it may not detect the latest vulnerabilities. A new dataset with added new vulnerabilities increases the accuracy of test results. Each dataset has several Common Weakness Enumerations (CWE) [605] or Common Vulnerabilities and Exposures (CVE) [606], with further vulnerabilities being detected daily. When the model is trained using some CVE or CWE datasets, the model cannot detect others, so the dataset should be diverse and updated with new vulnerabilities.
-
Transfer learning (TL): as TL defined previously, the learned model can then be reused in other DL tasks to improve their performance [23, 124]. This approach can help reduce the time and resources needed to train a DL model for different tasks and problems. This is desirable in software vulnerability detection because researchers can reuse vulnerability detection models across various software projects [611]. Unfortunately, vulnerabilities found in software implementations are typically language-specific and domain-specific (some may be even application-specific). Models trained on security vulnerabilities could be vastly different in different programming languages and application domains. Therefore, it is hard to generalize and reuse learned models, making it challenging to transfer the knowledge of learning. Currently, it is possible to use a separate detection model for each language and vulnerability type. In the future, we intend to investigate a generalized vulnerability detection model that is robust and efficient in detecting vulnerabilities across different software projects [611].
Tips for reporting the dataset
-
whether the dataset used is public or private. If it is public, the source of the dataset must be cited, including articles and links. If it is private, the collection process must be described.
-
the criteria for selecting the dataset/s and whether the dataset/s tests/test the hypothesis.
-
the details of the dataset/s including the type of data, number, and names of classes, size of samples, number of all samples, number of each class, and resolution. Figures are important to show samples of the dataset with the label of each class.
-
whether the dataset used is real or simulated. In the case of simulated data, the simulation process must be explained.
-
the labeling process of the dataset (private dataset) and whether the process was achieved by an expert or in an automated way.
-
the pre-processing stage and the data features that were manipulated.
-
changes to data after each step in situations where multi-pre-processing procedures were applied.
-
the data augmentation techniques (if used) with figures showing a sample of each technique used.
-
the ratios of training, validation, and testing sets. The rationale for choosing these ratios and ensuring these sets were unbiased regarding data characteristics must also be described.
-
comparisons with other methods. The same dataset with the same ratios of validation and testing sets must be used to ensure the comparison with other methods is valid.
-
the description of the dataset when it is uploaded to one of the public repositories.
Trustworthy training datasets
-
Quality of data: the data in the dataset should be accurate and relevant to the problem at hand.
-
Annotation quality: the annotations should be accurate and consistent if the annotation is needed.
-
Diversity: the dataset should be diverse and include a wide range of samples to ensure that the model learns to generalize to new scenarios.
-
Size: the size of the dataset can be a factor in its trustworthiness. A larger dataset can assist the model in learning more robust and generalizable features, but it is critical to make sure the data is high quality and diverse.
-
Source: the source of the data is important, as it should be from a trustworthy organization or individual.
-
Preprocessing: the data should be cleaned and preprocessed appropriately in order to be usable for training a DL model.
-
Balance: if the dataset is used for classification tasks, it should be balanced, meaning that it should include a roughly equal number of examples for each class. Imbalanced datasets can lead to DL models that are biased toward the more common classes.
-
Bias-free: bias in the data can lead to DL models that make biased decisions and do not generalize well to new situations. It is important to ensure that the data used to train a DL model is diverse and representative of the population the model will be used on, in order to avoid bias and improve model performance.
Discussion
-
Numerous TL approaches should be considered to train the DL model using unclassified image datasets, followed by knowledge transfer for training the DL model by using a reduced set of classified images for the same task.
-
Powerful and effective models can be generated to improve NN performance more comprehensively once RL and other models are combined with TL.
-
The increasing interest in using GAN stems from its ability to learn highly non-linear and deep mappings from latent space to data space and vice versa, as well as its ability to apply unclassified image data close to deep representation learning. Many algorithms and theories can be formulated by adopting the GAN framework, which is suitable for new applications with deep networks.
-
As indicated in previous sections, different loss functions have been introduced to help in training small data sets. We are convinced that it is worth investigating the loss functions to overcome the weakness of the previous approaches.
-
It is important to carefully curate and build a high-quality training dataset when developing DL models. A reliable and trustworthy training data set can greatly improve the performance of a model and help prevent overfitting.
-
As DL models become more complex in structure, it becomes more difficult for people to understand how they arrive at their decisions. Improving explainability is essential to build trust in these models and ensure that they make fair and unbiased decisions [625].
-
It is critical to ensure that DL models are robust/reliable and able to perform well with new data. It will require improving the quality and diversity of the data utilized to train them, as well as developing techniques to identify and address potential issues with the models.
-
Fairness in DL remains an open challenge and requires careful consideration of the data used to train the models, as well as both the potential biases present in that data and the development of techniques to overcome biases in the models [626].
-
Meta-learning and customized RL can be optimized for multiple applications [627]. meta-learning has the potential to significantly enhance the capabilities of DL models, particularly in scenarios where training data is scarce, making it a promising area of research in DL.
-
Knowledge distillation is another technique to address the issue of data scarcity which is worth more investigation. It involves training a smaller model to mimic the behavior of a larger model [628].
-
Information fusion involves combining information from multiple sources or modalities to make more accurate predictions or decisions in the context of DL. It can help overcome the limitations of individual data sources and improve model performance when training data is limited [629].
-
Federated learning is a DL technique that allows groups or organizations to collectively train and improve a shared global DL model [138]. However, the introduction of data fusion technology has brought new challenges for federated learning, such as the fusion of heterogeneous and multi-source data. As the variety and volume of data increase, it is essential to improve the use of data and models in federated learning. By eliminating redundant data and merging multiple data sources, it is possible to gain new and valuable information. In the future, issues such as maintaining user privacy, creating universal models, and ensuring the stability of data fusion results need to be addressed to facilitate the effective use of data in federated learning across multiple domains.
-
Finally, it is expected to see more pre-trained models in different areas similar to the ImageNet model, such as medical imaging [630]. That would be a great opportunity in terms of the generalization of DL models.