Elsevier

Expert Systems with Applications

Volume 94, 15 March 2018, Pages 205-217
Expert Systems with Applications

Designing architectures of convolutional neural networks to solve practical problems

https://doi.org/10.1016/j.eswa.2017.10.052Get rights and content

Highlights

  • Our approach aims to support the estimation of Convolutional Neural Network (CNN) parameters.

  • It intends to produce simpler CNN, reducing the complexity.

  • This estimation was based on False Nearest Neighbors method.

  • Caffe deep learning framework was used to conduct the training of CNN.

  • Our results are comparable even to very complex and empirical CNN architectures.

Abstract

The Convolutional Neural Network (CNN) figures among the state-of-the-art Deep Learning (DL) algorithms due to its robustness to support data shift, scale variations, and its capability of extracting relevant information from large-scale input data. However, setting appropriate parameters to define CNN architectures is still a challenging issue, mainly to tackle real-world problems. A typical approach consists in empirically assessing different CNN settings in order to select the most appropriate one. This procedure has clear limitations, including the choice of suitable predefined configurations as well as the high computational cost involved in evaluating each of them. This work presents a novel methodology to tackle the previously mentioned issues, providing mechanisms to estimate effective CNN configurations, including the size of convolutional masks (convolutional kernels) and the number of convolutional units (CNN neurons) per layer. Based on the False Nearest Neighbors (FNN), a well-known tool from the area of Dynamical Systems, the proposed method helps estimating CNN architectures that are less complex and produce good results. Our experiments confirm that architectures estimated through the proposed approach are as effective as the complex ones defined by empirical and computationally intensive strategies.

Introduction

The vast amount of data currently available has fostered the development of methodologies capable of processing and extracting meaningful features to assist the interpretation, understanding, and solution of complex problems. In this context, the area of Deep Learning (DL) has emerged as a main alternative to analyze massive data, presenting breakthrough results in tasks such as speech recognition (Graves, Mohamed, & Hinton, 2013), machine translation (Luong, Sutskever, Le, Vinyals, & Zaremba, 2014), and data classification (Lauer, Suen, Bloch, 2007, LeCun, Bottou, Bengio, Haffner, 1998, Sharif Razavian, Azizpour, Sullivan, Carlsson, 2014, Zhou, Lapedriza, Xiao, Torralba, Oliva, 2014).

Deep Learning algorithms operate in multiple levels, each of which composed of a set of regression models that involve linear and nonlinear components. The combination of multiple models makes possible the representation of complex functions (LeCun, Bengio, & Hinton, 2015). Most DL algorithms resemble Artificial Neural Networks, such as the Multilayer Perceptron (Haykin & Network, 2004), where input vectors are processed throughout consecutive layers containing operation units to emphasize or inhibit features (LeCun, Bottou, Bengio, Haffner, 1998, Sharif Razavian, Azizpour, Sullivan, Carlsson, 2014).

A particularly important DL algorithm is the so-called Convolutional Neural Network (CNN), which has gained prestige mainly due to its good performance in computer vision tasks (Oquab, Bottou, Laptev, Sivic, 2015, Scherer, Müller, Behnke, 2010a, Scherer, Schulz, Behnke, 2010b, Sharif Razavian, Azizpour, Sullivan, Carlsson, 2014, Zhou, Lapedriza, Xiao, Torralba, Oliva, 2014), and its feature extraction ability from time-dependent data such as audio and video (Karpathy, Toderici, Shetty, Leung, Sukthankar, Fei-Fei, 2014, Osadchy, Cun, Miller, 2007, Schluter, Bock, 2014). However, the performance of CNN strongly depends on its architecture, including the number of layers, units per layer, and convolutional mask sizes.

Moreover, an architecture that works well for a given problem may not be appropriate when dealing with different data types or tasks. An usual alternative to remedy this issue is to evaluate different CNN architectures, choosing the one with best performance. Such procedure clearly bears a number of drawbacks: (i) the training and evaluation of a single architecture is already computationally intensive, thus the overall assessment of several of those may be unfeasible to many scenarios; and (ii) the empirically defined architectures may not be appropriate for the problem under consideration (Lappas, Chen, 2009, Menotti, Chiachia, Falcao, Oliveira Neto, 2014), thus acceptable results may never be reached.

A strategy commonly employed to avoid the assessment of multiple CNN settings considers an additional training stage to tune weights associated with convolutional units. This also increases the computational burden and requires thousands/millions of examples to produce reasonable results (Lauer, Suen, Bloch, 2007, LeCun, Bottou, Bengio, Haffner, 1998, Simard, Steinkraus, Platt, 2003). Another strategy employed is the ensemble of CNNs, which usually allows a reduction in the maximum number of iterations at the cost of training more architectures to obtain relevant results. It is also time consuming though, threatening its application into production environments (Ciregan, Meier, & Schmidhuber, 2012).

In this work, we present a novel methodology to assist the definition of CNN architectures that differs substantially from the described alternatives. Specifically, our approach analyzes the input and output images produced by convolutional operations at each CNN layer in order to estimate the adequate dimensions for the convolutional masks, and to suitable set the number of convolutional units per layer. In addition, motivated by the Occam’s razor problem-solving principle (Blumer, Ehrenfeucht, Haussler, & Warmuth, 1987), we also aim to design as simple as possible, but yet efficient, architectures to address target problems.

The faster convergence of simpler CNN architectures is here demonstrated using the Statistical Learning Theory, most specifically as a result of the Chernoff bound and the Hoeffiding inequality (Devroye, Györfi, Lugosi, 2013, Vapnik, 2013, Von Luxburg, & Schölkopf, 2011). Eq. (1) presents the condition that ensures learning bounds, in which N(F,2n) corresponds to the Shattering coefficient, F is the set of all possible functions provided by an algorithm (a.k.a. algorithm bias), and n is the sample size (here two samples with n elements are provided). The Shattering coefficient is indeed a function of n which defines the maximum number of admissible functions contained in F that produce distinct classifications considering the worst possible sample organization with size n (Smola & Schölkopf, 1998). logN(F,2n)n0

As already discussed in Von Luxburg and Schölkopf (20011), the Shattering coefficient can be approximated to an2 when considering just one neuron of some deep network (see Eq. (2)), for some constant a > 0. Thus, Eq. (3) provides the Shattering coefficient for a given deep architecture composed of k units in total. logN(F,2n)n=logan2nlogN(F,2n)n=log(an2)kn=klogan2n

The sample size required to ensure convergence is defined in Eq. (4), in which R(f) represents the real risk (the expected value for the loss function, in range [0, 1]), Remp corresponds to the empirical risk (the average loss computed on a given sample, in range [0, 1]), 0 < ϵ < 1 is a threshold that indicates an acceptable divergence limit for risks, F is the set of all functions provided by some supervised learning algorithm, n is the sample size. Thus, the convergence of some algorithm for an given network architecture is here analyzed in terms of its number of units and sample size. Eq. (5) defines the generalization probability, in which P(supfF|R(f)Remp|>ϵ) represents the probability of an algorithm does not generalize. So, the main goal of the Statistical Learning Theory, as proved by Vapnik (2013), is to make sure term 2(an2)kenϵ/4 converges to zero, as a greater sample size is provided to guarantee generalization. P(supfF|R(f)Remp|>ϵ)2N(F,2n)enϵ/4P(supfF|R(f)Remp|>ϵ)2(an2)kenϵ/4

From Eq. (5), it is possible to conclude that additional CNN units directly require greater sample sizes in order to ensure the right-side term of such inequality approaches zero. Fig. 1 illustrates the generalization probability produced according to Eq. 5, in which the sample size n varies from 1 to 1 million, the number of network units k is set as {10, 50, 100, 250}, a=1010 and ϵ=0.05 (5% percent of divergence is accepted between the expected and the empirical risks). Convergences initiate when curves start decaying, but they only occur when an enough sample size is provided, so that it tends to zero.

The Shattering coefficient is the main responsible for holding back the convergence and make it require more data examples, as shown in Fig. 1. The greater the complexity of such Shattering coefficient, the longer it behaves as an exponential function, only after a great enough sample size it is dominated by the other Chernoff terms and, finally, decreases and converges. In addition, the theoretical convergence could not be illustrated for any network with 250 or more neurons: (i) first of all, term a should be smaller to make the Chernoff bound approximate zero; (ii) secondly, it tended to infinity as more data examples were provided; and finally, (iii) not even using 1 million examples it could converge in theory. We believe this theoretical demonstration is necessary and enough to confirm the need for designing simpler (less neurons and layers) architectures.

Based on this theoretical formulation after Vapnik (2013), it is obvious that large CNN networks require greater samples to guarantee the learning convergence and, finally, prove a good classifier was obtained. Thus, the goal of this paper is to estimate adequate parameters in order to design simpler and yet efficient CNN architectures, producing networks that converge faster with a reasonable number of training examples. In summary, we propose a method to estimate the adequate dimensions for the convolutional masks (convolutional kernels), and the number of convolutional units (CNN neurons) at each layer of a CNN architecture for general purpose classification tasks.

Motivated by the False Nearest Neighbors (FNN) (Kennel, Brown, & Abarbanel, 1992), a well-known tool from the area of Dynamical Systems, our method analyzes the input and output images produced by the convolutional operation of each CNN layer in order to estimate the adequate dimensions for the convolutional masks, and the suitable number of convolutional units per layer. In more details, this analysis takes each image and builds up vectors to embed data into high-dimensional spaces in attempt to increase the recurrence levels. Recurrences are here associated with the prevalent and most similar patterns happening in a given input image.

The CNN architectures produced by our approach were compared against more complex ones, using the Caffe deep learning framework (Jia et al., 2014). Four datasets were used: i) CMU Face Images (Roweis & Saul, 2000), that contains images of human faces; ii) MNIST, that is a dataset of handwritten digits; iii) Columbia University Image Library, which is referred as COIL-100 (Nene, Nayar, & Murase, 1996), that contains object images divided in 100 classes; and iv) German Traffic Sign Recognition Benchmark, which is referred as GTSRB (Stallkamp, Schlipsing, Salmen, & Igel, 2012), that is a dataset of traffic sign images.

Experimental results confirm our CNN architectures have error rates similar with the ones listed in the literature, but presenting a significant lower complexity (less layers and units per layer). Therefore, our methodology turns out to be a viable alternative to design simpler CNN architectures that are faster to be trained and less prone to produce overfitting (Cogswell, Ahmed, Girshick, Zitnick, & Batra, Lappas, Chen, 2009).

This paper is organized as follows: Section 2 discusses about Deep Learning approaches and their results, with particular attention to the datasets used in our experiments; Section 3 uses Linear Algebra to formulate CNN operations; Section 4 addresses the original False Nearest Neighbors (FNN) method, proposed by Kennel et al. (1992); in Section 5, we describe all necessary FNN modifications in order to make possible the estimation of CNN architectures; Section 6 shows experimental results performed on the four selected datasets in order to evaluate the mask sizes and the number of convolutional units for a CNN with one, two, and three convolutional layers; Section 7 presents the concluding remarks and perspectives for future work.

Section snippets

Related work

The good performance of DL algorithms in classification problems has motivated their use in several domains, in particular to tackle the handwritten digit recognition, where the MNIST dataset 1 is considered one of the main benchmarks. Among the best results reported for it, LeCun, Haffner, Bottou, and Bengio (1999) achieved 0.95% of error rate using a CNN with the LeNet-5 architecture, reducing to 0.8% after

Convolutional neural network

The Convolutional Neural Network (CNN) is a Deep Learning algorithm designed to process multidimensional data, such as signals, images and videos (LeCun et al., 2015), and extract relevant features even in the presence of noise, shifting, rescaling and other types of data distortions (Goodfellow, Courville, Bengio (2016), LeCun, Bengio, 1995, LeCun, Bottou, Bengio, Haffner, 1998). CNN is a multilayer network, and each layer is composed of units responsible for different types of operations,

False nearest neighbors

Takens (1981) proposed a methodology to unfold time-dependent data into multidimensional spaces, also referred to as phase spaces, which make easier the identification of recurrences, thus simplifying tasks such as modeling and forecasting. The method embeds a time series X={x0,,xn} in the phase space by producing states in the form xi(m,d)=(xi,xi+d,,xi+(m1)d), where parameter m defines the dimension of the phase space (also called embedding dimension) and d is the time delay. Although very

Adapting the false nearest neighbors method

We noticed that the vector representation vi,j=[Iix1,jy1,,Ii,j,,Iixm,jyn] used by CNN on a local region of some image I (Section 3) is obtained after an application of the Takens’ immersion theorem considering two embedding dimensions: the embedding dimension along rows (M), and the embedding dimensions over columns (N), which together define the dimensionality of vectors vi,j (see Section 3). Thus, we adapted the False Nearest Neighbors (FNN) method in order to properly estimate

Experiments

This section presents the datasets considered, then we show the setup for our approach based on the False Nearest Neighbors method, and for the Convolutional Neural Network (CNN). Next, we present the results that our FNN approach produced when evaluating the training examples contained in each dataset, while: i) assessing the mask sizes for convolutional units; and ii) varying the number of convolutional units. Then, the best mask sizes and number of units found were used to perform

Conclusions

Studies on deep Learning have usually considered very complex CNN architectures, containing many layers and convolutional units in attempt to improve classification tasks. However, most of those lack in terms of justifying the parametrization considered. In fact, most of them simply analyze several CNN settings to empirically find an adequate architecture. What they probably miss is that those settings may not be enough to provide simpler and adequate architectures to tackle practical problems,

Acknowledgements

We would like to thank Prof. Moacir Ponti for reviewing this work as well as for his suggestions. This paper is supported by CAPES, Brazil, under grant no. 7901561/D, by CNPq, Brazil, under grants 03051/2014-0 and 302643/2013-3, and by FAPESP, Brazil, under grant no. 2011/22749-8, 2012/17961-0 and 2014/13323-5. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the CAPES, CNPq or FAPESP.

References (60)

  • A. Blumer et al.

    Occam’s razor

    Information Processing Letters

    (1987)
  • Y. LeCun

    Learning invariant feature hierarchies

    Computer vision–ECCV 2012. Workshops and demonstrations

    (2012)
  • G. Strang

    Linear algebra and its applications

    (1988)
  • S. Albelwi et al.

    A framework for designing the architectures of deep convolutional neural networks

    Entropy

    (2017)
  • K.T. Alligood et al.

    Chaos in differential equations

    Chaos

    (1997)
  • L. Bottou

    Stochastic gradient learning in neural networks

    Proceedings of Neuro-Nımes

    (1991)
  • D. Ciregan et al.

    Multi-column deep neural networks for image classification

    Computer vision and pattern recognition (CVPR), 2012 IEEE conference on

    (2012)
  • Cogswell, M., Ahmed, F., Girshick, R.B., Zitnick, L., & Batra, D. (2015). Reducing overfitting in deep networks by...
  • Daumé IIIH. et al.

    Frustratingly easy domain adaptation

    Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing

    (2010)
  • J. Dean et al.

    Large scale distributed deep networks

    Advances in neural information processing systems

    (2012)
  • L. Devroye et al.

    A probabilistic theory of pattern recognition

    (2013)
  • B.A. Garro et al.

    Designing artificial neural networks using particle swarm optimization algorithms

    Computational Intelligence and Neuroscience

    (2015)
  • I.J. Goodfellow et al.

    Maxout networks

    ICML

    (2013)
  • Goodfellow, I., Bengio, Y., & Courville, A., (2016). Deep learning. Book in preparation for MIT Press....
  • A. Graves et al.

    Speech recognition with deep recurrent neural networks

    2013 IEEE international conference on acoustics, speech and signal processing

    (2013)
  • S. Hassairi et al.

    Supervised image classification using deep convolutional wavelets network

    Tools with artificial intelligence (ICTAI), 2015 IEEE 27th international conference on

    (2015)
  • S. Haykin et al.

    A comprehensive foundation

    Neural Networks

    (2004)
  • G.E. Hinton et al.

    A fast learning algorithm for deep belief nets

    Neural Computation

    (2006)
  • F.J. Huang et al.

    Large-scale learning with SVM and convolutional for generic object categorization

    Computer vision and pattern recognition, 2006 IEEE computer society conference on

    (2006)
  • R. Ihaka et al.

    R: a language for data analysis and graphics

    Journal of computational and graphical statistics

    (1996)
  • K. Jarrett et al.

    What is the best multi-stage architecture for object recognition?

    Computer vision, 2009 IEEE 12th international conference on

    (2009)
  • Y. Jia et al.

    Caffe: Convolutional architecture for fast feature embedding

    Proceedings of the ACM international conference on multimedia

    (2014)
  • H. Kantz et al.

    Nonlinear time series analysis

    (2004)
  • A. Karpathy et al.

    Large-scale video classification with convolutional neural networks

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2014)
  • M.B. Kennel et al.

    Determining embedding dimension for phase-space reconstruction using a geometrical construction

    Physical review A

    (1992)
  • G. Lappas et al.

    Neural networks and multimedia datasets: Estimating the size of neural networks for achieving high classification accuracy

    Wseas international conference. Proceedings. Mathematics and computers in science and engineering

    (2009)
  • F. Lauer et al.

    A trainable feature extractor for handwritten digit recognition

    Pattern Recognition

    (2007)
  • Y. LeCun et al.

    Convolutional networks for images, speech, and time series

    The handbook of brain theory and neural networks

    (1995)
  • Y. LeCun et al.

    Deep learning

    Nature

    (2015)
  • Y. LeCun et al.

    Gradient-based learning applied to document recognition

    Proceedings of the IEEE

    (1998)
  • Cited by (32)

    • IntelliSwAS: Optimizing deep neural network architectures using a particle swarm-based approach

      2022, Expert Systems with Applications
      Citation Excerpt :

      Neural architecture search algorithms aim at searching for and discovering efficient neural network architectures in an automated manner. These algorithms are designed to find CNNs (Baker, Gupta, Naik, & Raskar, 2017; Ferreira, Corrêa, Nonato, & de Mello, 2018; Liu et al., 2018; Zoph, Vasudevan, Shlens, & Le, 2018), RNNs (Bayer, Wierstra, Togelius, & Schmidhuber, 2009; Nistor, Moca, & Nistor, 2020; Rawal & Miikkulainen, 2018) or can be applied for searching for both of these two types of networks (Pham, Guan, Zoph, Le, & Dean, 2018; Zoph & Le, 2016). While a larger number of NAS algorithms were proposed in the recent years for optimizing the structures of neural networks for given tasks, the idea of automating the selection of the structure and the hyperparameters of machine learning algorithms has existed for a long period of time.

    • Designing a composite deep learning based differential protection scheme of power transformers

      2020, Applied Soft Computing Journal
      Citation Excerpt :

      During recent years, DNN has exhibited remarkable capability in different areas, such as medical diagnosis and image analysis [26–28], sentiment analysis [29,30], fault detection [31–34], nonlinear modeling and parameter estimation [35,36], cybersecurity [37], image processing [38–40], speech–music recognition [41], language modeling [42], time-series forecasting [43–45], and activity recognition [46,47]. In the classification process, DNN-based methods especially convolutional neural networks (CNN) [48–50] (originally introduced by [51]) are able to incorporate spatial and temporal networks to diagnose specific anomaly signals/images/videos [52]. On the other hand, there is a major complexity related to external factors in differential protection, which can dramatically reduce the reliability due to the inherent limitations.

    • Gated recurrent unit based frequency-dependent hysteresis modeling and end-to-end compensation

      2020, Mechanical Systems and Signal Processing
      Citation Excerpt :

      For the constructed BPNN, the input layer consists of three nodes corresponding to input voltage, voltage frequency, and output displacement of a GRU, the output layer includes a node for the final output displacement of the hysteresis model. According to Occam’s razor, the structure of a neural network should be selected as compact as possible while maintaining the accuracy of the model [34]. Meanwhile, it has been theoretically proven that a BPNN with a single hidden layer is capable of approximating any continuous function with desired accuracy as far as there are sufficient hidden neurons [35].

    View all citing articles on Scopus
    View full text