nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

3. Training Deep Neural Networks

verfasst von : Charu C. Aggarwal

Erschienen in: Neural Networks and Deep Learning

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The procedure for training neural networks with backpropagation is briefly introduced in Chapter 1 This chapter will expand on the description on Chapter 1 in several ways

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Machine Learning with Shallow Neural Networks

Nächstes Kapitel Teaching Deep Learners to Generalize

Although the backpropagation algorithm was popularized by the Rumelhart et al. papers [408, 409], it had been studied earlier in the context of control theory. Crucially, Paul Werbos’s forgotten (and eventually rediscovered) thesis in 1974 discussed how these backpropagation methods could be used in neural networks. This was well before Rumelhart et al.’s papers in 1986, which were nevertheless significant because the style of presentation contributed to a better understanding of why backpropagation might work.

A different type of manifestation occurs in cases where the parameters in earlier and later layers are shared. In such cases, the effect of an update can be highly unpredictable because of the combined effect of different layers. Such scenarios occur in recurrent neural networks in which the parameters in later temporal layers are tied to those of earlier temporal layers. In such cases, small changes in the parameters can cause large changes in the loss function in very localized regions without any gradient-based indication in nearby regions. Such topological characteristics of the loss function are referred to as cliffs (cf. Section 3.5.4), and they make the problem harder to optimize because the gradient descent tends to either overshoot or undershoot.

In most of this book, we have worked with \(\overline{W}\) as a row-vector. However, it is notationally convenient here to work with \(\overline{W}\) as a column-vector.

[7]

R. Ahuja, T. Magnanti, and J. Orlin. Network flows: Theory, algorithms, and applications. Prentice Hall, 1993.

[13]

J. Ba and R. Caruana. Do deep nets really need to be deep? NIPS Conference, pp. 2654–2662, 2014.

[14]

J. Ba, J. Kiros, and G. Hinton. Layer normalization. arXiv:1607.06450, 2016.https://arxiv.org/abs/1607.06450

[23]

M. Bazaraa, H. Sherali, and C. Shetty. Nonlinear programming: theory and algorithms. John Wiley and Sons, 2013.

[24]

S. Becker, and Y. LeCun. Improving the convergence of back-propagation learning with second order methods. Proceedings of the 1988 connectionist models summer school, pp. 29–37, 1988.

[36]

J. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl. Algorithms for hyper-parameter optimization. NIPS Conference, pp. 2546–2554, 2011.

[37]

J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13, pp. 281–305, 2012.MathSciNetMATH

[38]

J. Bergstra, D. Yamins, and D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. ICML Confererence, pp. 115–123, 2013.

[39]

D. Bertsekas. Nonlinear programming Athena Scientific, 1999.

[41]

C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.

[42]

C. M. Bishop. Bayesian Techniques. Chapter 10 in “Neural Networks for Pattern Recognition,” pp. 385–439, 1995.

[54]

A. Bryson. A gradient method for optimizing multi-stage allocation processes. Harvard University Symposium on Digital Computers and their Applications, 1961.

[55]

C. Bucilu, R. Caruana, and A. Niculescu-Mizil. Model compression. ACM KDD Conference, pp. 535–541, 2006.

[66]

W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. ICML Confererence, pp. 2285–2294, 2015.

[74]

A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, and B. Catanzaro. Deep learning with COTS HPC systems. ICML Confererence, pp. 1337–1345, 2013.

[81]

T. Cooijmans, N. Ballas, C. Laurent, C. Gulcehre, and A. Courville. Recurrent batch normalization. arXiv:1603.09025, 2016.https://arxiv.org/abs/1603.09025

[88]

Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS Conference, pp. 2933–2941, 2014.

[91]

J. Dean et al. Large scale distributed deep networks. NIPS Conference, 2012.

[94]

M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Freitas. Predicting parameters in deep learning. NIPS Conference, pp. 2148–2156, 2013.

[96]

G. Desjardins, K. Simonyan, and R. Pascanu. Natural neural networks. NIPS Congference, pp. 2071–2079, 2015.

[98]

T. Dettmers. 8-bit approximations for parallelism in deep learning. arXiv:1511.04561, 2015.https://arxiv.org/abs/1511.04561

[108]

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, pp. 2121–2159, 2011.MathSciNetMATH

[133]

H. Gavin. The Levenberg-Marquardt method for nonlinear least squares curve-fitting problems, 2011.http://people.duke.edu/~hpgavin/ce281/lm.pdf

[140]

X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. AISTATS, pp. 249–256, 2010.

[141]

X. Glorot, A. Bordes, and Y. Bengio. Deep Sparse Rectifier Neural Networks. AISTATS, 15(106), 2011.

[146]

I. Goodfellow, O. Vinyals, and A. Saxe. Qualitatively characterizing neural network optimization problems. arXiv:1412.6544, 2014. [Also appears in International Conference in Learning Representations, 2015]https://arxiv.org/abs/1412.6544

[148]

I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv:1302.4389, 2013.

[167]

R. Hahnloser and H. S. Seung. Permitted and forbidden sets in symmetric threshold-linear networks. NIPS Conference, pp. 217–223, 2001.

[168]

S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally. EIE: Efficient Inference Engine for Compressed Neural Network. ACM SIGARCH Computer Architecture News, 44(3), pp. 243–254, 2016.CrossRef

[169]

S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural networks. NIPS Conference, pp. 1135–1143, 2015.

[171]

M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. ICML Confererence, pp. 1225–1234, 2006.

[183]

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. IEEE International Conference on Computer Vision, pp. 1026–1034, 2015.

[184]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.

[189]

M. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6), 1952.

[194]

G. Hinton. Neural networks for machine learning, Coursera Video, 2012.

[202]

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. NIPS Workshop, 2014.

[203]

R. Hochberg. Matrix Multiplication with CUDA: A basic introduction to the CUDA programming model. Unpublished manuscript, 2012. http://www.shodor.org/media/content/petascale/materials/UPModules/matrixMultiplication/moduleDocument.pdf

[205]

S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press, 2001.

[213]

F. Iandola, S. Han, M. Moskewicz, K. Ashraf, W. Dally, and K. Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv:1602.07360, 2016.https://arxiv.org/abs/1602.07360

[214]

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015.

[217]

R. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks, 1(4), pp. 295–307, 1988.CrossRef

[221]

K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? International Conference on Computer Vision (ICCV), 2009.

[237]

H. J. Kelley. Gradient theory of optimal flight paths. Ars Journal, 30(10), pp. 947–954, 1960.CrossRef

[241]

D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.https://arxiv.org/abs/1412.6980

[254]

A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997, 2014.https://arxiv.org/abs/1404.5997

[255]

A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. NIPS Conference, pp. 1097–1105. 2012.

[273]

Q. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Ng, On optimization methods for deep learning. ICML Conference, pp. 265–272, 2011.

[278]

Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. in G. Orr and K. Muller (eds.) Neural Networks: Tricks of the Trade, Springer, 1998.

[282]

Y. LeCun, J. Denker, and S. Solla. Optimal brain damage. NIPS Conference, pp. 598–605, 1990.

[300]

D. Luenberger and Y. Ye. Linear and nonlinear programming, Addison-Wesley, 1984.

[306]

D. J. MacKay. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3), pp. 448–472, 1992.CrossRef

[313]

J. Martens. Deep learning via Hessian-free optimization. ICML Conference, pp. 735–742, 2010.

[314]

J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization. ICML Conference, pp. 1033–1040, 2011.

[315]

J. Martens, I. Sutskever, and K. Swersky. Estimating the hessian by back-propagating curvature. arXiv:1206.6464, 2016.https://arxiv.org/abs/1206.6464

[316]

J. Martens and R. Grosse. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. ICML Conference, 2015.

[324]

T. Mikolov. Statistical language models based on neural networks. Ph.D. thesis, Brno University of Technology, 2012.

[330]

M. Minsky and S. Papert. Perceptrons. An Introduction to Computational Geometry, MIT Press, 1969.

[353]

Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1∕k ²). Soviet Mathematics Doklady, 27, pp. 372–376, 1983.MATH

[359]

J. Nocedal and S. Wright. Numerical optimization. Springer, 2006.

[362]

G. Orr and K.-R. Müller (editors). Neural Networks: Tricks of the Trade, Springer, 1998.

[368]

R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. ICML Conference, 28, pp. 1310–1318, 2013.

[369]

R. Pascanu, T. Mikolov, and Y. Bengio. Understanding the exploding gradient problem. CoRR, abs/1211.5063, 2012.

[376]

E. Polak. Computational methods in optimization: a unified approach. Academic Press, 1971.

[380]

B. Polyak and A. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), pp. 838–855, 1992.MathSciNetCrossRef

[408]

D. Rumelhart, G. Hinton, and R. Williams. Learning representations by back-propagating errors. Nature, 323 (6088), pp. 533–536, 1986.CrossRef

[409]

D. Rumelhart, G. Hinton, and R. Williams. Learning internal representations by back-propagating errors. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, pp. 318–362, 1986.

[419]

T. Salimans and D. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. NIPS Conference, pp. 901–909, 2016.

[426]

A. Saxe, J. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013.

[429]

T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. ICML Confererence, pp. 343–351, 2013.

[443]

J. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. Technical Report, CMU-CS-94-125, Carnegie-Mellon University, 1994.

[458]

J. Snoek, H. Larochelle, and R. Adams. Practical bayesian optimization of machine learning algorithms. NIPS Conference, pp. 2951–2959, 2013.

[478]

I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. ICML Confererence, pp. 1139–1147, 2013.

[490]

C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. ACM KDD Conference, pp. 847–855, 2013.

[524]

P. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974.

[525]

P. Werbos. The roots of backpropagation: from ordered derivatives to neural networks and political forecasting (Vol. 1). John Wiley and Sons, 1994.

[532]

S. Wieseler and H. Ney. A convergence analysis of log-linear training. NIPS Conference, pp. 657–665, 2011.

[541]

O. Yadan, K. Adams, Y. Taigman, and M. Ranzato. Multi-gpu training of convnets. arXiv:1312.5853, 2013.https://arxiv.org/abs/1312.5853

[545]

H. Yu and B. Wilamowski. Levenberg–Marquardt training. Industrial Electronics Handbook, 5(12), 1, 2011.

[553]

M. Zeiler. ADADELTA: an adaptive learning rate method. arXiv:1212.5701, 2012.https://arxiv.org/abs/1212.5701

[571]

http://caffe.berkeleyvision.org/

[572]

http://torch.ch/

[573]

http://deeplearning.net/software/theano/

[574]

https://www.tensorflow.org/

[614]

http://jaberg.github.io/hyperopt/

[615]

http://www.cs.ubc.ca/labs/beta/Projects/SMAC/

[616]

https://github.com/JasperSnoek/spearmint

[643]

https://developer.nvidia.com/cudnn

[644]

http://www.nvidia.com/object/machine-learning.html

[645]

https://developer.nvidia.com/deep-learning-frameworks

Titel: Training Deep Neural Networks
verfasst von: Charu C. Aggarwal
Verlag: Springer International Publishing
Buch: Neural Networks and Deep Learning
Print ISBN: 978-3-319-94462-3

Electronic ISBN: 978-3-319-94463-0

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-3-319-94463-0_3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"