Designing architectures of convolutional neural networks to solve practical problems
Introduction
The vast amount of data currently available has fostered the development of methodologies capable of processing and extracting meaningful features to assist the interpretation, understanding, and solution of complex problems. In this context, the area of Deep Learning (DL) has emerged as a main alternative to analyze massive data, presenting breakthrough results in tasks such as speech recognition (Graves, Mohamed, & Hinton, 2013), machine translation (Luong, Sutskever, Le, Vinyals, & Zaremba, 2014), and data classification (Lauer, Suen, Bloch, 2007, LeCun, Bottou, Bengio, Haffner, 1998, Sharif Razavian, Azizpour, Sullivan, Carlsson, 2014, Zhou, Lapedriza, Xiao, Torralba, Oliva, 2014).
Deep Learning algorithms operate in multiple levels, each of which composed of a set of regression models that involve linear and nonlinear components. The combination of multiple models makes possible the representation of complex functions (LeCun, Bengio, & Hinton, 2015). Most DL algorithms resemble Artificial Neural Networks, such as the Multilayer Perceptron (Haykin & Network, 2004), where input vectors are processed throughout consecutive layers containing operation units to emphasize or inhibit features (LeCun, Bottou, Bengio, Haffner, 1998, Sharif Razavian, Azizpour, Sullivan, Carlsson, 2014).
A particularly important DL algorithm is the so-called Convolutional Neural Network (CNN), which has gained prestige mainly due to its good performance in computer vision tasks (Oquab, Bottou, Laptev, Sivic, 2015, Scherer, Müller, Behnke, 2010a, Scherer, Schulz, Behnke, 2010b, Sharif Razavian, Azizpour, Sullivan, Carlsson, 2014, Zhou, Lapedriza, Xiao, Torralba, Oliva, 2014), and its feature extraction ability from time-dependent data such as audio and video (Karpathy, Toderici, Shetty, Leung, Sukthankar, Fei-Fei, 2014, Osadchy, Cun, Miller, 2007, Schluter, Bock, 2014). However, the performance of CNN strongly depends on its architecture, including the number of layers, units per layer, and convolutional mask sizes.
Moreover, an architecture that works well for a given problem may not be appropriate when dealing with different data types or tasks. An usual alternative to remedy this issue is to evaluate different CNN architectures, choosing the one with best performance. Such procedure clearly bears a number of drawbacks: (i) the training and evaluation of a single architecture is already computationally intensive, thus the overall assessment of several of those may be unfeasible to many scenarios; and (ii) the empirically defined architectures may not be appropriate for the problem under consideration (Lappas, Chen, 2009, Menotti, Chiachia, Falcao, Oliveira Neto, 2014), thus acceptable results may never be reached.
A strategy commonly employed to avoid the assessment of multiple CNN settings considers an additional training stage to tune weights associated with convolutional units. This also increases the computational burden and requires thousands/millions of examples to produce reasonable results (Lauer, Suen, Bloch, 2007, LeCun, Bottou, Bengio, Haffner, 1998, Simard, Steinkraus, Platt, 2003). Another strategy employed is the ensemble of CNNs, which usually allows a reduction in the maximum number of iterations at the cost of training more architectures to obtain relevant results. It is also time consuming though, threatening its application into production environments (Ciregan, Meier, & Schmidhuber, 2012).
In this work, we present a novel methodology to assist the definition of CNN architectures that differs substantially from the described alternatives. Specifically, our approach analyzes the input and output images produced by convolutional operations at each CNN layer in order to estimate the adequate dimensions for the convolutional masks, and to suitable set the number of convolutional units per layer. In addition, motivated by the Occam’s razor problem-solving principle (Blumer, Ehrenfeucht, Haussler, & Warmuth, 1987), we also aim to design as simple as possible, but yet efficient, architectures to address target problems.
The faster convergence of simpler CNN architectures is here demonstrated using the Statistical Learning Theory, most specifically as a result of the Chernoff bound and the Hoeffiding inequality (Devroye, Györfi, Lugosi, 2013, Vapnik, 2013, Von Luxburg, & Schölkopf, 2011). Eq. (1) presents the condition that ensures learning bounds, in which corresponds to the Shattering coefficient, is the set of all possible functions provided by an algorithm (a.k.a. algorithm bias), and n is the sample size (here two samples with n elements are provided). The Shattering coefficient is indeed a function of n which defines the maximum number of admissible functions contained in that produce distinct classifications considering the worst possible sample organization with size n (Smola & Schölkopf, 1998).
As already discussed in Von Luxburg and Schölkopf (20011), the Shattering coefficient can be approximated to an2 when considering just one neuron of some deep network (see Eq. (2)), for some constant a > 0. Thus, Eq. (3) provides the Shattering coefficient for a given deep architecture composed of k units in total.
The sample size required to ensure convergence is defined in Eq. (4), in which R(f) represents the real risk (the expected value for the loss function, in range [0, 1]), corresponds to the empirical risk (the average loss computed on a given sample, in range [0, 1]), 0 < ϵ < 1 is a threshold that indicates an acceptable divergence limit for risks, is the set of all functions provided by some supervised learning algorithm, n is the sample size. Thus, the convergence of some algorithm for an given network architecture is here analyzed in terms of its number of units and sample size. Eq. (5) defines the generalization probability, in which represents the probability of an algorithm does not generalize. So, the main goal of the Statistical Learning Theory, as proved by Vapnik (2013), is to make sure term converges to zero, as a greater sample size is provided to guarantee generalization.
From Eq. (5), it is possible to conclude that additional CNN units directly require greater sample sizes in order to ensure the right-side term of such inequality approaches zero. Fig. 1 illustrates the generalization probability produced according to Eq. 5, in which the sample size n varies from 1 to 1 million, the number of network units k is set as {10, 50, 100, 250}, and (5% percent of divergence is accepted between the expected and the empirical risks). Convergences initiate when curves start decaying, but they only occur when an enough sample size is provided, so that it tends to zero.
The Shattering coefficient is the main responsible for holding back the convergence and make it require more data examples, as shown in Fig. 1. The greater the complexity of such Shattering coefficient, the longer it behaves as an exponential function, only after a great enough sample size it is dominated by the other Chernoff terms and, finally, decreases and converges. In addition, the theoretical convergence could not be illustrated for any network with 250 or more neurons: (i) first of all, term a should be smaller to make the Chernoff bound approximate zero; (ii) secondly, it tended to infinity as more data examples were provided; and finally, (iii) not even using 1 million examples it could converge in theory. We believe this theoretical demonstration is necessary and enough to confirm the need for designing simpler (less neurons and layers) architectures.
Based on this theoretical formulation after Vapnik (2013), it is obvious that large CNN networks require greater samples to guarantee the learning convergence and, finally, prove a good classifier was obtained. Thus, the goal of this paper is to estimate adequate parameters in order to design simpler and yet efficient CNN architectures, producing networks that converge faster with a reasonable number of training examples. In summary, we propose a method to estimate the adequate dimensions for the convolutional masks (convolutional kernels), and the number of convolutional units (CNN neurons) at each layer of a CNN architecture for general purpose classification tasks.
Motivated by the False Nearest Neighbors (FNN) (Kennel, Brown, & Abarbanel, 1992), a well-known tool from the area of Dynamical Systems, our method analyzes the input and output images produced by the convolutional operation of each CNN layer in order to estimate the adequate dimensions for the convolutional masks, and the suitable number of convolutional units per layer. In more details, this analysis takes each image and builds up vectors to embed data into high-dimensional spaces in attempt to increase the recurrence levels. Recurrences are here associated with the prevalent and most similar patterns happening in a given input image.
The CNN architectures produced by our approach were compared against more complex ones, using the Caffe deep learning framework (Jia et al., 2014). Four datasets were used: i) CMU Face Images (Roweis & Saul, 2000), that contains images of human faces; ii) MNIST, that is a dataset of handwritten digits; iii) Columbia University Image Library, which is referred as COIL-100 (Nene, Nayar, & Murase, 1996), that contains object images divided in 100 classes; and iv) German Traffic Sign Recognition Benchmark, which is referred as GTSRB (Stallkamp, Schlipsing, Salmen, & Igel, 2012), that is a dataset of traffic sign images.
Experimental results confirm our CNN architectures have error rates similar with the ones listed in the literature, but presenting a significant lower complexity (less layers and units per layer). Therefore, our methodology turns out to be a viable alternative to design simpler CNN architectures that are faster to be trained and less prone to produce overfitting (Cogswell, Ahmed, Girshick, Zitnick, & Batra, Lappas, Chen, 2009).
This paper is organized as follows: Section 2 discusses about Deep Learning approaches and their results, with particular attention to the datasets used in our experiments; Section 3 uses Linear Algebra to formulate CNN operations; Section 4 addresses the original False Nearest Neighbors (FNN) method, proposed by Kennel et al. (1992); in Section 5, we describe all necessary FNN modifications in order to make possible the estimation of CNN architectures; Section 6 shows experimental results performed on the four selected datasets in order to evaluate the mask sizes and the number of convolutional units for a CNN with one, two, and three convolutional layers; Section 7 presents the concluding remarks and perspectives for future work.
Section snippets
Related work
The good performance of DL algorithms in classification problems has motivated their use in several domains, in particular to tackle the handwritten digit recognition, where the MNIST dataset 1 is considered one of the main benchmarks. Among the best results reported for it, LeCun, Haffner, Bottou, and Bengio (1999) achieved 0.95% of error rate using a CNN with the LeNet-5 architecture, reducing to 0.8% after
Convolutional neural network
The Convolutional Neural Network (CNN) is a Deep Learning algorithm designed to process multidimensional data, such as signals, images and videos (LeCun et al., 2015), and extract relevant features even in the presence of noise, shifting, rescaling and other types of data distortions (Goodfellow, Courville, Bengio (2016), LeCun, Bengio, 1995, LeCun, Bottou, Bengio, Haffner, 1998). CNN is a multilayer network, and each layer is composed of units responsible for different types of operations,
False nearest neighbors
Takens (1981) proposed a methodology to unfold time-dependent data into multidimensional spaces, also referred to as phase spaces, which make easier the identification of recurrences, thus simplifying tasks such as modeling and forecasting. The method embeds a time series in the phase space by producing states in the form where parameter m defines the dimension of the phase space (also called embedding dimension) and d is the time delay. Although very
Adapting the false nearest neighbors method
We noticed that the vector representation used by CNN on a local region of some image I (Section 3) is obtained after an application of the Takens’ immersion theorem considering two embedding dimensions: the embedding dimension along rows (M), and the embedding dimensions over columns (N), which together define the dimensionality of vectors (see Section 3). Thus, we adapted the False Nearest Neighbors (FNN) method in order to properly estimate
Experiments
This section presents the datasets considered, then we show the setup for our approach based on the False Nearest Neighbors method, and for the Convolutional Neural Network (CNN). Next, we present the results that our FNN approach produced when evaluating the training examples contained in each dataset, while: i) assessing the mask sizes for convolutional units; and ii) varying the number of convolutional units. Then, the best mask sizes and number of units found were used to perform
Conclusions
Studies on deep Learning have usually considered very complex CNN architectures, containing many layers and convolutional units in attempt to improve classification tasks. However, most of those lack in terms of justifying the parametrization considered. In fact, most of them simply analyze several CNN settings to empirically find an adequate architecture. What they probably miss is that those settings may not be enough to provide simpler and adequate architectures to tackle practical problems,
Acknowledgements
We would like to thank Prof. Moacir Ponti for reviewing this work as well as for his suggestions. This paper is supported by CAPES, Brazil, under grant no. 7901561/D, by CNPq, Brazil, under grants 03051/2014-0 and 302643/2013-3, and by FAPESP, Brazil, under grant no. 2011/22749-8, 2012/17961-0 and 2014/13323-5. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the CAPES, CNPq or FAPESP.
References (60)
- et al.
Occam’s razor
Information Processing Letters
(1987) Learning invariant feature hierarchies
Computer vision–ECCV 2012. Workshops and demonstrations
(2012)Linear algebra and its applications
(1988)- et al.
A framework for designing the architectures of deep convolutional neural networks
Entropy
(2017) - et al.
Chaos in differential equations
Chaos
(1997) Stochastic gradient learning in neural networks
Proceedings of Neuro-Nımes
(1991)- et al.
Multi-column deep neural networks for image classification
Computer vision and pattern recognition (CVPR), 2012 IEEE conference on
(2012) - Cogswell, M., Ahmed, F., Girshick, R.B., Zitnick, L., & Batra, D. (2015). Reducing overfitting in deep networks by...
- et al.
Frustratingly easy domain adaptation
Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing
(2010) - et al.
Large scale distributed deep networks
Advances in neural information processing systems
(2012)
A probabilistic theory of pattern recognition
Designing artificial neural networks using particle swarm optimization algorithms
Computational Intelligence and Neuroscience
Maxout networks
ICML
Speech recognition with deep recurrent neural networks
2013 IEEE international conference on acoustics, speech and signal processing
Supervised image classification using deep convolutional wavelets network
Tools with artificial intelligence (ICTAI), 2015 IEEE 27th international conference on
A comprehensive foundation
Neural Networks
A fast learning algorithm for deep belief nets
Neural Computation
Large-scale learning with SVM and convolutional for generic object categorization
Computer vision and pattern recognition, 2006 IEEE computer society conference on
R: a language for data analysis and graphics
Journal of computational and graphical statistics
What is the best multi-stage architecture for object recognition?
Computer vision, 2009 IEEE 12th international conference on
Caffe: Convolutional architecture for fast feature embedding
Proceedings of the ACM international conference on multimedia
Nonlinear time series analysis
Large-scale video classification with convolutional neural networks
Proceedings of the IEEE conference on computer vision and pattern recognition
Determining embedding dimension for phase-space reconstruction using a geometrical construction
Physical review A
Neural networks and multimedia datasets: Estimating the size of neural networks for achieving high classification accuracy
Wseas international conference. Proceedings. Mathematics and computers in science and engineering
A trainable feature extractor for handwritten digit recognition
Pattern Recognition
Convolutional networks for images, speech, and time series
The handbook of brain theory and neural networks
Deep learning
Nature
Gradient-based learning applied to document recognition
Proceedings of the IEEE
Cited by (32)
IntelliSwAS: Optimizing deep neural network architectures using a particle swarm-based approach
2022, Expert Systems with ApplicationsCitation Excerpt :Neural architecture search algorithms aim at searching for and discovering efficient neural network architectures in an automated manner. These algorithms are designed to find CNNs (Baker, Gupta, Naik, & Raskar, 2017; Ferreira, Corrêa, Nonato, & de Mello, 2018; Liu et al., 2018; Zoph, Vasudevan, Shlens, & Le, 2018), RNNs (Bayer, Wierstra, Togelius, & Schmidhuber, 2009; Nistor, Moca, & Nistor, 2020; Rawal & Miikkulainen, 2018) or can be applied for searching for both of these two types of networks (Pham, Guan, Zoph, Le, & Dean, 2018; Zoph & Le, 2016). While a larger number of NAS algorithms were proposed in the recent years for optimizing the structures of neural networks for given tasks, the idea of automating the selection of the structure and the hyperparameters of machine learning algorithms has existed for a long period of time.
A machine learning approach for forecasting hierarchical time series
2021, Expert Systems with ApplicationsAI applications to medical images: From machine learning to deep learning
2021, Physica MedicaSound quality prediction and improving of vehicle interior noise based on deep convolutional neural networks
2020, Expert Systems with ApplicationsDesigning a composite deep learning based differential protection scheme of power transformers
2020, Applied Soft Computing JournalCitation Excerpt :During recent years, DNN has exhibited remarkable capability in different areas, such as medical diagnosis and image analysis [26–28], sentiment analysis [29,30], fault detection [31–34], nonlinear modeling and parameter estimation [35,36], cybersecurity [37], image processing [38–40], speech–music recognition [41], language modeling [42], time-series forecasting [43–45], and activity recognition [46,47]. In the classification process, DNN-based methods especially convolutional neural networks (CNN) [48–50] (originally introduced by [51]) are able to incorporate spatial and temporal networks to diagnose specific anomaly signals/images/videos [52]. On the other hand, there is a major complexity related to external factors in differential protection, which can dramatically reduce the reliability due to the inherent limitations.
Gated recurrent unit based frequency-dependent hysteresis modeling and end-to-end compensation
2020, Mechanical Systems and Signal ProcessingCitation Excerpt :For the constructed BPNN, the input layer consists of three nodes corresponding to input voltage, voltage frequency, and output displacement of a GRU, the output layer includes a node for the final output displacement of the hysteresis model. According to Occam’s razor, the structure of a neural network should be selected as compact as possible while maintaining the accuracy of the model [34]. Meanwhile, it has been theoretically proven that a BPNN with a single hidden layer is capable of approximating any continuous function with desired accuracy as far as there are sufficient hidden neurons [35].