Introduction
-
This is the first review that almost provides a deep survey of the most important aspects of deep learning. This review helps researchers and students to have a good understanding from one paper.
-
We explain CNN in deep which the most popular deep learning algorithm by describing the concepts, theory, and state-of-the-art architectures.
-
We review current challenges (limitations) of Deep Learning including lack of training data, Imbalanced Data, Interpretability of data, Uncertainty scaling, Catastrophic forgetting, Model compression, Overfitting, Vanishing gradient problem, Exploding Gradient Problem, and Underspecification. We additionally discuss the proposed solutions tackling these issues.
-
We provide an exhaustive list of medical imaging applications with deep learning by categorizing them based on the tasks by starting with classification and ending with registration.
-
We discuss the computational approaches (CPU, GPU, FPGA) by comparing the influence of each tool on deep learning algorithms.
Survey methodology
Journal | IF 2019 | CiteScore 2019 | Publisher | Journal homepage |
---|---|---|---|---|
Pattern Recognition | 7.196 | 13.1 | Elsevier | |
Pattern Recognition Letter | 3.255 | 6.3 | Elsevier | |
Artificial Intelligence Review | 5.747 | 9.1 | Springer | |
Expert Systems with Applications | 5.452 | 11 | Elsevier | |
Neurocomputing | 4.438 | 9.5 | Elsevier | |
Nature Medicine | 36.130 | 45.9 | Nature | |
Nature | 42.779 | 51 | Nature | |
Journal of Big Data | – | 6.1 | Springer | |
Multimedia Tools and Applications | 2.313 | 3.7 | Springer | |
Computer Methods and Programs in Biomedicine | 3.632 | 7.5 | Elsevier | |
Machine Learning | 2.672 | 5.0 | Springer | |
Machine Vision and Applications | 1.605 | 4.2 | Springer | |
Medical Image Analysis | 11.148 | 17.2 | Elsevier | |
IEEE Access | 3.745 | 3.9 | IEEE | |
IEEE Transactions on Knowledge and Data Engineering | 4.935 | 12.0 | IEEE | |
Nature Communications | 12.121 | 18.1 | Nature | |
IEEE Transactions on Intelligent Transportation Systems | 6.319 | 12.7 | IEEE | |
Methods | 3.812 | 8.0 | Elsevier | |
ACM Journal on Emerging Technologies in Computing Systems | 1.652 | 4.3 | ACM | |
ACM Computing Surveys | 6.319 | 12.7 | ACM | |
Applied Soft Computing | 5.472 | 10.2 | Elsevier | |
Electronics | 2.412 | 1.9 | MDPI | |
Applied Sciences | 2.474 | 2.4 | MDPI | |
IEEE Transactions on Industrial Informatics | 9.112 | 13.9 | IEEE |
Background
When to apply deep learning
-
Cases where human experts are not available.
-
Cases where humans are unable to explain decisions made using their expertise (language understanding, medical decisions, and speech recognition).
-
Cases where the problem solution updates over time (price prediction, stock preference, weather prediction, and tracking).
-
Cases where solutions require adaptation based on specific cases (personalization, biometrics).
-
Cases where size of the problem is extremely large and exceeds our inadequate reasoning abilities (sentiment analysis, matching ads to Facebook, calculation webpage ranks).
Why deep learning?
Classification of DL approaches
Deep supervised learning
Deep semi-supervised learning
Deep unsupervised learning
Deep reinforcement learning
-
It assists you to identify which action produces the highest reward over a longer period.
-
It assists you to discover which situation requires action.
-
It also enables it to figure out the best approach for reaching large rewards.
-
Reinforcement Learning also gives the learning agent a reward function.
-
In case there is sufficient data to resolve the issue with supervised learning techniques.
-
Reinforcement Learning is computing-heavy and time-consuming. Specially when the workspace is large.
Types of DL networks
Recursive neural networks
Recurrent neural networks
Convolutional neural networks
Benefits of employing CNNs
CNN layers
-
Kernel definition: A grid of discrete numbers or values describes the kernel. Each value is called the kernel weight. Random numbers are assigned to act as the weights of the kernel at the beginning of the CNN training process. In addition, there are several different methods used to initialize the weights. Next, these weights are adjusted at each training era; thus, the kernel learns to extract significant features.
-
Convolutional Operation: Initially, the CNN input format is described. The vector format is the input of the traditional neural network, while the multi-channeled image is the input of the CNN. For instance, single-channel is the format of the gray-scale image, while the RGB image format is three-channeled. To understand the convolutional operation, let us take an example of a \(4 \times 4\) gray-scale image with a \(2 \times 2\) random weight-initialized kernel. First, the kernel slides over the whole image horizontally and vertically. In addition, the dot product between the input image and the kernel is determined, where their corresponding values are multiplied and then summed up to create a single scalar value, calculated concurrently. The whole process is then repeated until no further sliding is possible. Note that the calculated dot product values represent the feature map of the output. Figure 8 graphically illustrates the primary calculations executed at each step. In this figure, the light green color represents the \(2 \times 2\) kernel, while the light blue color represents the similar size area of the input image. Both are multiplied; the end result after summing up the resulting product values (marked in a light orange color) represents an entry value to the output feature map.×However, padding to the input image is not applied in the previous example, while a stride of one (denoted for the selected step-size over all vertical or horizontal locations) is applied to the kernel. Note that it is also possible to use another stride value. In addition, a feature map of lower dimensions is obtained as a result of increasing the stride value.On the other hand, padding is highly significant to determining border size information related to the input image. By contrast, the border side-features moves carried away very fast. By applying padding, the size of the input image will increase, and in turn, the size of the output feature map will also increase. Core Benefits of Convolutional Layers.
-
Sparse Connectivity: Each neuron of a layer in FC neural networks links with all neurons in the following layer. By contrast, in CNNs, only a few weights are available between two adjacent layers. Thus, the number of required weights or connections is small, while the memory required to store these weights is also small; hence, this approach is memory-effective. In addition, matrix operation is computationally much more costly than the dot (.) operation in CNN.
-
Weight Sharing: There are no allocated weights between any two neurons of neighboring layers in CNN, as the whole weights operate with one and all pixels of the input matrix. Learning a single group of weights for the whole input will significantly decrease the required training time and various costs, as it is not necessary to learn additional weights for each neuron.
-
-
Sigmoid: The input of this activation function is real numbers, while the output is restricted to between zero and one. The sigmoid function curve is S-shaped and can be represented mathematically by Eq. 2.$$ f(x)_{sigm}=\frac{1}{1+e^{-x}} $$(2)
-
Tanh: It is similar to the sigmoid function, as its input is real numbers, but the output is restricted to between − 1 and 1. Its mathematical representation is in Eq. 3.$$ f(x)_{tanh}=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} $$(3)
-
ReLU: The mostly commonly used function in the CNN context. It converts the whole values of the input to positive numbers. Lower computational load is the main benefit of ReLU over the others. Its mathematical representation is in Eq. 4.Occasionally, a few significant issues may occur during the use of ReLU. For instance, consider an error back-propagation algorithm with a larger gradient flowing through it. Passing this gradient within the ReLU function will update the weights in a way that makes the neuron certainly not activated once more. This issue is referred to as “Dying ReLU”. Some ReLU alternatives exist to solve such issues. The following discusses some of them.$$ f(x)_{ReLU}= max(0,x) $$(4)
-
Leaky ReLU: Instead of ReLU down-scaling the negative inputs, this activation function ensures these inputs are never ignored. It is employed to solve the Dying ReLU problem. Leaky ReLU can be represented mathematically as in Eq. 5.Note that the leak factor is denoted by m. It is commonly set to a very small value, such as 0.001.$$\begin{aligned} f(x)_{Leaky ReLU}= \left \{ \begin{array}{ll} x,& if \quad x > 0\\ mx,& x \le 0 \end{array} \right \} \end{aligned}$$(5)
-
Noisy ReLU: This function employs a Gaussian distribution to make ReLU noisy. It can be represented mathematically as in Eq. 6.$$ f(x)_{Noisy ReLU}= max(x+Y),with\, Y \sim N (0,\sigma (x)) $$(6)
-
Parametric Linear Units: This is mostly the same as Leaky ReLU. The main difference is that the leak factor in this function is updated through the model training process. The parametric linear unit can be represented mathematically as in Eq. 7.Note that the learnable weight is denoted as a.$$\begin{aligned} f(x)_{ Parametric Linear}=\begin{Bmatrix} x,& if\; x >0\\ ax,& x \le 0 \end{Bmatrix} \end{aligned}$$(7)
Regularization to CNN
-
It prevents the problem of vanishing gradient from arising.
-
It can effectively control the poor weight initialization.
-
It significantly reduces the time required for network convergence (for large-scale datasets, this will be extremely useful).
-
It struggles to decrease training dependency across hyper-parameters.
-
Chances of over-fitting are reduced, since it has a minor influence on regularization.
Optimizer selection
Design of algorithms (backpropagation)
Improving performance of CNN
-
Expand the dataset with data augmentation or use transfer learning (explained in latter sections).
-
Increase the training time.
-
Increase the depth (or width) of the model.
-
Add regularization.
-
Increase hyperparameters tuning.
CNN architectures
Model | Main finding | Depth | Dataset | Error rate | Input size | Year |
---|---|---|---|---|---|---|
AlexNet | Utilizes Dropout and ReLU | 8 | ImageNet | 16.4 | \(227 \times 227 \times 3\) | 2012 |
NIN | New layer, called ‘mlpconv’, utilizes GAP | 3 | CIFAR-10, CIFAR-100, MNIST | 10.41, 35.68, 0.45 | \(32 \times 32 \times 3\) | 2013 |
ZfNet | Visualization idea of middle layers | 8 | ImageNet | 11.7 | \(224 \times 224 \times 3\) | 2014 |
VGG | Increased depth, small filter size | 16, 19 | ImageNet | 7.3 | \(224 \times 224 \times 3\) | 2014 |
GoogLeNet | Increased depth,block concept, different filter size, concatenation concept | 22 | ImageNet | 6.7 | \(224 \times 224 \times 3\) | 2015 |
Inception-V3 | Utilizes small filtersize, better feature representation | 48 | ImageNet | 3.5 | \(229 \times 229 \times 3\) | 2015 |
Highway | Presented the multipath concept | 19, 32 | CIFAR-10 | 7.76 | \(32 \times 32 \times 3\) | 2015 |
Inception-V4 | Divided transform and integration concepts | 70 | ImageNet | 3.08 | \(229 \times 229 \times 3\) | 2016 |
ResNet | Robust against overfitting due to symmetry mapping-based skip links | 152 | ImageNet | 3.57 | \(224 \times 224 \times 3\) | 2016 |
Inception-ResNet-v2 | Introduced the concept of residual links | 164 | ImageNet | 3.52 | \(229 \times 229 \times 3\) | 2016 |
FractalNet | Introduced the concept of Drop-Path as regularization | 40,80 | CIFAR-10 | 4.60 | \(32 \times 32 \times 3\) | 2016 |
CIFAR-100 | 18.85 | |||||
WideResNet | Decreased the depth and increased the width | 28 | CIFAR-10 | 3.89 | \(32 \times 32 \times 3\) | 2016 |
CIFAR-100 | 18.85 | |||||
Xception | A depthwise convolutionfollowed by a pointwise convolution | 71 | ImageNet | 0.055 | \(229 \times 229 \times 3\) | 2017 |
Residual attention neural network | Presented the attention technique | 452 | CIFAR-10, CIFAR-100 | 3.90, 20.4 | \(40 \times 40\times 3\) | 2017 |
Squeeze-and-excitation networks | Modeled interdependencies between channels | 152 | ImageNet | 2.25 | \(229 \times 229 \times 3\) | 2017 |
\(224 \times 224 \times 3\) | ||||||
\(320 \times 320 \times 3\) | ||||||
DenseNet | Blocks of layers; layers connected to each other | 201 | CIFAR-10, CIFAR-100,ImageNet | 3.46, 17.18, 5.54 | \(224 \times 224 \times 3\) | 2017 |
Competitive squeeze and excitation network | Both residual and identity mappings utilized to rescale the channel | 152 | CIFAR-10 | 3.58 | \(32 \times 32 \times 3\) | 2018 |
CIFAR-100 | 18.47 | |||||
MobileNet-v2 | Inverted residual structure | 53 | ImageNet | – | \(224 \times 224 \times 3\) | 2018 |
CapsuleNet | Pays attention to special relationships between features | 3 | MNIST | 0.00855 | \(28 \times 28 \times 1\) | 2018 |
HRNetV2 | High-resolution representations | – | ImageNet | 5.4 | \(224 \times 224 \times 3\) | 2020 |
AlexNet
Network-in-network
ZefNet
Visual geometry group (VGG)
GoogLeNet
Highway network
ResNet
Inception: ResNet and Inception-V3/4
DenseNet
ResNext
WideResNet
Pyramidal Net
Xception
Residual attention neural network
Convolutional block attention module
Concurrent spatial and channel excitation mechanism
CapsuleNet
High-resolution network (HRNet)
Challenges (limitations) of deep learning and alternate solutions
Training data
Transfer learning
Data augmentation techniques
Imbalanced data
Interpretability of data
Uncertainty scaling
Catastrophic forgetting
Model compression
Overfitting
Vanishing gradient problem
Exploding gradient problem
Underspecification
Applications of deep learning
Classification
Localization
Detection
Segmentation
Registration
-
Target Selection: it illustrates the determined input image that the second counterpart input image needs to remain accurately superimposed to.
-
Feature Extraction: it computes the set of features extracted from each input image.
-
Feature Matching: it allows finding similarities between the previously obtained features.
-
Pose Optimization: it is aimed to minimize the distance between both input images.
Computational approaches
Feature | Assessment | Leader |
---|---|---|
Development | CPU is the easiest to program, then GPU, then FPGA | CPU |
Size | Both FPGA and CPU have smaller volume solutions due to their lower power consumption | FPGA-CPU |
Customization | Broader flexibility is provided by FPGA | FPGA |
Ease of change | Easier way to vary application functionality is provided by GPU and CPU | GPU-CPU |
Backward compatibility | Transferring RTL to novel FPGA requires additional work. Furthermore, GPU has less stable architecture than CPU | CPU |
Interfaces | Several varieties of interfaces can be implemented using FPGA | FPGA |
Processing/$ | FPGA configurability assists utilization in wider acceleration space. Due to the considerable processing abilities, GPU wins | FPGA-GPU |
Processing/watt | Customized designs can be optimized | FPGA |
Timing latency | Implemented FPGA algorithm offers deterministic timing, which is in turn much faster than GPU | FPGA |
Large data analysis | FPGA performs well for inline processing, while CPU supports storage capabilities and largest memory | FPGA-CPU |
DCNN inference | FPGA has lower latency and can be customized | FPGA |
DCNN training | Greater float-point capabilities provided by GPU | GPU |
CPU-based approach
GPU-based approach
FPGA-based approach
Evaluation metrics
Frameworks and datasets
Framework | License | Core language | Year of release | Homepages |
---|---|---|---|---|
TensorFlow | Apache 2.0 | C++ & Python | 2015 | |
Keras | MIT | Python | 2015 | |
Caffe | BSD | C++ | 2015 | |
MatConvNet | Oxford | MATLAB | 2014 | |
MXNet | Apache 2.0 | C++ | 2015 | |
CNTK | MIT | C++ | 2016 | |
Theano | BSD | Python | 2008 | |
Torch | BSD | C & Lua | 2002 | |
DL4j | Apache 2.0 | Java | 2014 | |
Gluon | AWS Microsoft | C++ | 2017 | |
OpenDeep | MIT | Python | 2017 |
Dataset | Num. of classes | Applications | Link to dataset |
---|---|---|---|
ImageNet | 1000 | Image classification, object localization, object detection, etc. | |
CIFAR10/100 | 10/100 | Image classification | |
MNIST | 10 | Classification of handwritten digits | |
Pascal VOC | 20 | Image classification, segmentation, object detection | |
Microsoft COCO | 80 | Object detection, semantic segmentation | |
YFCC100M | 8M | Video and image understanding | |
YouTube-8M | 4716 | Video classification | |
UCF-101 | 101 | Human action detection | |
Kinetics | 400 | Human action detection | |
Google Open Images | 350 | Image classification, segmentation, object detection | |
CalTech101 | 101 | Classification | |
Labeled Faces in the Wild | – | Face recognition | |
MIT-67 scene dataset | 67 | Indoor scene recognition |
Summary and conclusion
-
DL already experiences difficulties in simultaneously modeling multi-complex modalities of data. In recent DL developments, another common approach is that of multimodal DL.
-
DL requires sizeable datasets (labeled data preferred) to predict unseen data and to train the models. This challenge turns out to be particularly difficult when real-time data processing is required or when the provided datasets are limited (such as in the case of healthcare data). To alleviate this issue, TL and data augmentation have been researched over the last few years.
-
Although ML slowly transitions to semi-supervised and unsupervised learning to manage practical data without the need for manual human labeling, many of the current deep-learning models utilize supervised learning.
-
The CNN performance is greatly influenced by hyper-parameter selection. Any small change in the hyper-parameter values will affect the general CNN performance. Therefore, careful parameter selection is an extremely significant issue that should be considered during optimization scheme development.
-
Impressive and robust hardware resources like GPUs are required for effective CNN training. Moreover, they are also required for exploring the efficiency of using CNN in smart and embedded systems.
-
In the CNN context, ensemble learning [342, 343] represents a prospective research area. The collection of different and multiple architectures will support the model in improving its generalizability across different image categories through extracting several levels of semantic image representation. Similarly, ideas such as new activation functions, dropout, and batch normalization also merit further investigation.
-
The exploitation of depth and different structural adaptations is significantly improved in the CNN learning capacity. Substituting the traditional layer configuration with blocks results in significant advances in CNN performance, as has been shown in the recent literature. Currently, developing novel and efficient block architectures is the main trend in new research models of CNN architectures. HRNet is only one example that shows there are always ways to improve the architecture.
-
It is expected that cloud-based platforms will play an essential role in the future development of computational DL applications. Utilizing cloud computing offers a solution to handling the enormous amount of data. It also helps to increase efficiency and reduce costs. Furthermore, it offers the flexibility to train DL architectures.
-
With the recent development in computational tools including a chip for neural networks and a mobile GPU, we will see more DL applications on mobile devices. It will be easier for users to use DL.
-
Regarding the issue of lack of training data, It is expected that various techniques of transfer learning will be considered such as training the DL model on large unlabeled image datasets and next transferring the knowledge to train the DL model on a small number of labeled images for the same task.
-
Last, this overview provides a starting point for the community of DL being interested in the field of DL. Furthermore, researchers would be allowed to decide the more suitable direction of work to be taken in order to provide more accurate alternatives to the field.