Introduction

It is our belief that researchers and practitioners acquire, through experience and word-of-mouth, techniques and heuristics that help them successfully apply neural networks to difficult real world problems. Often these “tricks” are theoretically well motivated. Sometimes they are the result of trial and error. However, their most common link is that they are usually hidden in people’s heads or in the back pages of space-constrained conference papers. As a result newcomers to the field waste much time wondering why their networks train so slowly and perform so poorly.

This book is an outgrowth of a 1996 NIPS workshop called

Tricks of the Trade

whose goal was to begin the process of gathering and documenting these tricks. The interest that the workshop generated, motivated us to expand our collection and compile it into this book. Although we have no doubt that there are many tricks we have missed, we hope that what we have included will prove to be useful, particularly to those who are relatively new to the field. Each chapter contains one or more tricks presented by a given author (or authors). We have attempted to group related chapters into sections, though we recognize that the different sections are far from disjoint. Some of the chapters (e.g. 1,13,17) contain entire systems of tricks that are far more general than the category they have been placed in.

Before each section we provide the reader with a summary of the tricks contained within, to serve as a quick overview and reference. However, we do not recommend applying tricks before having read the accompanying chapter. Each trick may only work in a particular context that is not fully explained in the summary. This is particularly true for the chapters that present systems where combinations of tricks must be applied together for them to be effective.

Below we give a coarse roadmap of the contents of the individual chapters.

Klaus-Robert Müller

Speeding Learning

There are those who argue that developing fast algorithms is no longer necessary because computers have become so fast. However, we believe that the complexity of our algorithms and the size of our problems will always expand to consume all cycles available, regardless of the speed of ourmachines.Thus, there will never come a time when computational efficiency can or should be ignored. Besides, in the quest to find solutions faster, we also often find better and more stable solutions as well. This section is devoted to techniques for making the learning process in backpropagation (BP) faster and more efficient. It contains a single chapter based on a workshop by Leon Bottou and Yann LeCun. While many alternative learning systems have emerged since the time BP was first introduced, BP is still the most widely used learning algorithm.The reason for this is its simplicity, efficiency, and its general effectiveness on a wide range of problems. Even so, there are many pitfalls in applying it, which is where all these tricks enter.

Klaus-Robert Müller

1. Efficient BackProp

The convergence of back-propagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work.

Many authors have suggested that second-order optimization methods are advantageous for neural net training. It is shown that most “classical” second-order methods are impractical for large neural networks. A few methods are proposed that do not have these limitations.

Yann A. LeCun, Léon Bottou, Genevieve B. Orr, Klaus-Robert Müller

Regularization Techniques to Improve Generalization

Good tricks for regularization are extremely important for improving the generalization ability of neural networks. The first and most commonly used trick is

early stopping

, which was originally described in [

11

]. In its simplest version, the trick is as follows:

Take an independent validation set, e.g. take out a part of the training set, and monitor the error on this set during training. The error on the training set will decrease, whereas the error on the validation set will first decrease and then increase. The early stopping point occurs where the error on the validation set is the lowest. It is here that the network weights provide the best generalization
.

Klaus-Robert Müller

2. Early Stopping — But When?

Validation can be used to detect when overfitting starts during supervised training of a neural network; training is then stopped before convergence to avoid the overfitting (“early stopping”). The exact criterion used for validation-based early stopping, however, is usually chosen in an ad-hoc fashion or training is stopped interactively. This trick describes how to select a stopping criterion in a systematic fashion; it is a trick for either speeding learning procedures or improving generalization, whichever is more important in the particular situation. An empirical investigation on multi-layer perceptrons shows that there exists a tradeoff between training time and generalization: From the given mix of 1296 training runs using different 12 problems and 24 different network architectures I conclude slower stopping criteria allow for small improvements in generalization (here: about 4% on average), but cost much more training time (here: about factor 4 longer on average).

Lutz Prechelt

3. A Simple Trick for Estimating the Weight Decay Parameter

We present a simple trick to get an approximate estimate of the weight decay parameter

λ

. The method combines early stopping and weight decay, into the estimate

$ \hat\lambda = \parallel \nabla E(W_{es})\parallel /\parallel 2W_{es}\parallel, $

where

W

es

is the set of weights at the early stopping point, and

E

(

W

) is the training data fit error.

The estimate is demonstrated and compared to the standard cross-validation procedure for

λ

selection on one synthetic and four real life data sets. The result is that

$\hat\lambda$

is as good an estimator for the optimal weight decay parameter value as the standard search estimate, but orders of magnitude quicker to compute.

The results also show that weight decay can produce solutions that are significantly superior to committees of networks trained with early stopping.

Thorsteinn S. Rögnvaldsson

4. Controlling the Hyperparameter Search in MacKay’s Bayesian Neural Network Framework

In order to achieve good generalization with neural networks overfitting must be controlled. Weight penalty factors are one common method of providing this control. However, using weight penalties creates the additional search problem of finding the optimal penalty factors. MacKay [5] proposed an approximate Bayesian framework for training neural networks, in which penalty factors are treated as hyperparameters and found in an iterative search. However, for classification networks trained with cross-entropy error, this search is slow and unstable, and it is not obvious how to improve it. This paper describes and compares several strategies for controlling this search. Some of these strategies greatly improve the speed and stability of the search. Test runs on a range of tasks are described.

Tony Plate

5. Adaptive Regularization in Neural Network Modeling

In this paper we address the important problem of optimizing regularization parameters in neural network modeling. The suggested optimization scheme is an extended version of the recently presented algorithm [25]. The idea is to minimize an empirical estimate - like the cross-validation estimate - of the generalization error with respect to regularization parameters. This is done by employing a simple iterative gradient descent scheme using virtually no additional programming overhead compared to standard training. Experiments with feed-forward neural network models for time series prediction and classification tasks showed the viability and robustness of the algorithm. Moreover, we provided some simple theoretical examples in order to illustrate the potential and limitations of the proposed regularization framework.

Jan Larsen, Claus Svarer, Lars Nonboe Andersen, Lars Kai Hansen

6. Large Ensemble Averaging

Averaging over many predictors leads to a reduction of the variance portion of the error. We present a method for evaluating the mean squared error of an infinite ensemble of predictors from finite (small size) ensemble information. We demonstrate it on ensembles of networks with different initial choices of synaptic weights. We find that the optimal stopping criterion for large ensembles occurs later in training time than for single networks. We test our method on the suspots data set and obtain excellent results.

David Horn, Ury Naftaly, Nathan Intrator

Improving Network Models and Algorithmic Tricks

This section contains 5 chapters presenting easy to implement tricks which modify either the architecture and/or the learning algorithm so as to enhance the network’s modeling ability. Better modeling means better solutions in less time.

Klaus-Robert Müller

7. Square Unit Augmented, Radially Extended, Multilayer Perceptrons

Consider a multilayer perceptron (MLP) with

d

inputs, a single hidden sigmoidal layer and a linear output. By adding an additional

d

inputs to the network with values set to the square of the first d inputs, properties reminiscent of higher-order neural networks and radial basis function networks (RBFN) are added to the architecture with little added expense in terms of weight requirements. Of particular interest, this architecture has the ability to form localized features in a

d

-dimensional space with a single hidden node but can also span large volumes of the input space; thus, the architecture has the localized properties of an RBFN but does not suffer as badly from the curse of dimensionality. I refer to a network of this type as a SQuare Unit Augmented, Radially Extended, MultiLayer Perceptron (SQUARE-MLP or SMLP).

Gary William Flake

8. A Dozen Tricks with Multitask Learning

Multitask Learning is an inductive transfer method that improves generalization accuracy on a main task by using the information contained in the training signals of other

tasks. It does this by learning the extra tasks in parallel with the main task while using a shared representation; what is learned for each task can help other tasks be learned better. This chapter describes a dozen opportunities for applying multitask learning in real problems. At the end of the chapter we also make several suggestions for how to get the most our of multitask learning on real-world problems.

Rich Caruana

9. Solving the Ill-Conditioning in Neural Network Learning

In this paper we investigate the feed-forward learning problem. The well-known ill-conditioning which is present in most feed-forward learning problems is shown to be the result of the structure of the network. Also, the well-known problem that weights between ‘higher’ layers in the network have to settle before ‘lower’ weights can converge is addressed. We present a solution to these problems by modifying the structure of the network through the addition of linear connections which carry shared weights. We call the new network structure the

linearly augmented feed-forward network

, and it is shown that the universal approximation theorems are still valid. Simulation experiments show the validity of the new method, and demonstrate that the new network is less sensitive to local minima and learns faster than the original network.

Patrick van der Smagt, Gerd Hirzinger

10. Centering Neural Network Gradient Factors

It has long been known that neural networks can learn faster when their input and hidden unit activities are centered about zero; recently we have extended this approach to also encompass the centering of error signals [15]. Here we generalize this notion to

all

factors involved in the network’s gradient, leading us to propose centering the slope of hidden unit activation functions as well. Slope centering removes the linear component of backpropagated error; this improves credit assignment in networks with shortcut connections. Benchmark results show that this can speed up learning significantly without adversely affecting the trained network’s generalization ability.

Nicol N. Schraudolph

11. Avoiding Roundoff Error in Backpropagating Derivatives

One significant source of roundoff error in backpropagation networks is the calculation of derivatives of unit outputs with respect to their total inputs. The roundoff error can lead result in high relative error in derivatives, and in particular, derivatives being calculated to be zero when in fact they are small but non-zero. This roundoff error is easily avoided with a simple programming trick which has a small memory overhead (one or two extra floating point numbers per unit) and an insignificant computational overhead.

Tony Plate

Representing and Incorporating Prior Knowledge in Neural Network Training

The present section focuses on tricks for four important aspects in learning: (1) incorporation of prior knowledge, (2) choice of representation for the learning task, (3) unequal class prior distributions, and finally (4) large network training.

Klaus-Robert Müller

12. Transformation Invariance in Pattern Recognition – Tangent Distance and Tangent Propagation

In pattern recognition, statistical modeling, or regression, the amount of data is a critical factor affecting the performance. If the amount of data and computational resources are unlimited, even trivial algorithms will converge to the optimal solution. However, in the practical case, given limited data and other resources, satisfactory performance requires sophisticated methods to regularize the problem by introducing

a priori

knowledge. Invariance of the output with respect to certain transformations of the input is a typical example of such

a priori

knowledge. In this chapter, we introduce the concept of tangent vectors, which compactly represent the essence of these transformation invariances, and two classes of algorithms, “tangent distance” and “tangent propagation”, which make use of these invariances to improve performance.

Patrice Y. Simard, Yann A. LeCun, John S. Denker, Bernard Victorri

13. Combining Neural Networks and Context-Driven Search for On-line, Printed Handwriting Recognition in the Newton

While on-line handwriting recognition is an area of long-standing and ongoing research, the recent emergence of portable, pen-based computers has focused urgent attention on usable, practical solutions. We discuss a combination and improvement of classical methods to produce robust recognition of hand-printed English text, for a recognizer shipping in new models of Apple Computer’s Newton MessagePad® and eMate®. Combining an artificial neural network (ANN), as a character classifier, with a context-driven search over segmentation and word recognition hypotheses provides an effective recognition system. Long-standing issues relative to training, generalization, segmentation, models of context, probabilistic formalisms, etc., need to be resolved, however, to get excellent performance. We present a number of recent innovations in the application of ANNs as character classifiers for word recognition, including integrated multiple representations, normalized output error, negative training, stroke warping, frequency balancing, error emphasis, and quantized weights. User-adaptation and extension to cursive recognition pose continuing challenges.

Larry S. Yaeger, Brandyn J. Webb, Richard F. Lyon

14. Neural Network Classification and Prior Class Probabilities

A commonly encountered problem in MLP (multi-layer perceptron) classification problems is related to the prior probabilities of the individual classes – if the number of training examples that correspond to each class varies significantly between the classes, then it may be harder for the network to learn the rarer classes in some cases. Such practical experience does not match theoretical results which show that MLPs approximate Bayesian

a posteriori

probabilities (independent of the prior class probabilities). Our investigation of the problem shows that the difference between the theoretical and practical results lies with the assumptions made in the theory (accurate estimation of Bayesian

a posteriori

probabilities requires the network to be large enough, training to converge to a global minimum, infinite training data, and the

a priori

class probabilities of the test set to be correctly represented in the training set). Specifically, the problem can often be traced to the fact that efficient MLP training mechanisms lead to sub-optimal solutions for most practical problems. In this chapter, we demonstrate the problem, discuss possible methods for alleviating it, and introduce new heuristics which are shown to perform well on a sample ECG classification problem. The heuristics may also be used as a simple means of adjusting for unequal misclassification costs.

Steve Lawrence, Ian Burns, Andrew Back, Ah Chung Tsoi, C. Lee Giles

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks

Rather than presenting a specific trick, this paper aims at providing a methodology for large scale, real-world classification tasks involving thousands of classes and millions of training patterns. Such problems arise in speech recognition, handwriting recognition and speaker or writer identification, just to name a few. Given the typically very large number of classes to be distinguished, many approaches focus on parametric methods to independently estimate class conditional likelihoods. In contrast, we demonstrate how the principles of modularity and hierarchy can be applied to directly estimate posterior class probabilities in a connectionist framework. Apart from offering better discrimination capability, we argue that a hierarchical classification scheme is crucial in tackling the above mentioned problems. Furthermore, we discuss training issues that have to be addressed when an almost infinite amount of training data is available.

Jürgen Fritsch, Michael Finke

Tricks for Time Series

In the last section we focus on tricks related to time series analysis and economic forecasting. In chapter 16, John Moody opens with a survey of the challenges of macroeconomic forecasting including problems such as noise, nonstationarities, nonlinearities, and the lack of good a

priori

models. Lest one be discouraged, descriptions of many possible neural network solutions are next presented including hyperparameter selection (e.g. for regularization, training window length), input variable selection, model selection (size and topology of network), better regularizers, committee forecasts, and model visualization.

Klaus-Robert Müller

16. Forecasting the Economy with Neural Nets: A Survey of Challenges and Solutions

Macroeconomic forecasting is a very difficult task due to the lack of an accurate, convincing model of the economy. The most accurate models for economic forecasting, “black box” time series models, assume little about the structure of the economy. Constructing reliable time series models is challenging due to short data series, high noise levels, nonstationarities, and nonlinear effects. This chapter describes these challenges and presents some neural network solutions to them. Important issues include balancing the

bias/variance tradeoff

and the

noise/nonstationarity tradeoff

. A brief survey of methods includes hyperparameter selection (regularization parameter and training window length), input variable selection and pruning, network architecture selection and pruning, new smoothing regularizers, committee forecasts and model visualization. Separate sections present more in-depth descriptions of smoothing regularizers, architecture selection via the

generalized prediction error (GPE)

and

nonlinear cross-validation (NCV)

, input selection via

sensitivity based pruning (SBP)

, and model interpretation and visualization. Throughout, empirical results are presented for forecasting the U.S. Index of Industrial Production. These demonstrate that, relative to conventional linear time series and regression methods, superior performance can be obtained using state-of-the-art neural network models.

John Moody

17. How to Train Neural Networks

The purpose of this paper is to give a guidance in neural network modeling. Starting with the preprocessing of the data, we discuss different types of network architecture and show how these can be combined effectively. We analyze several cost functions to avoid unstable learning due to outliers and heteroscedasticity. The Observer - Observation Dilemma is solved by forcing the network to construct smooth approximation functions. Furthermore, we propose some pruning algorithms to optimize the network architecture. All these features and techniques are linked up to a complete and consistent training procedure (see figure 17.25 for an overview), such that the synergy of the methods is maximized.

Ralph Neuneier, Hans Georg Zimmermann

Big Learning in Deep Neural Networks

Big Learning and Deep Neural Networks

More data and compute resources opens the way to “big learning”, that is, scaling up machine learning to large data sets and complex problems. In order to solve these new problems, we need to identify the complex dependencies that interrelate inputs and outputs [1]. Achieving this goal requires powerful algorithms in terms of both representational power and ability to absorb and distill information from large data flows. More and more it becomes apparent that neural networks provide an excellent toolset to scale learning. This holds true for simple linear systems as well as today’s grand challenges where the underlying application problem requires high nonlinearity and complex structured representations.

Grégoire Montavon, Klaus-Robert Müller

18. Stochastic Gradient Descent Tricks

Chapter 1 strongly advocates the stochastic back-propagation method to train neural networks. This is in fact an instance of a more general technique called stochastic gradient descent (SGD). This chapter provides background material, explains why SGD is a good learning algorithm when the training set is large, and provides useful recommendations.

Léon Bottou

19. Practical Recommendations for Gradient-Based Training of Deep Architectures

Learning algorithms related to artificial neural networks and in particular for Deep Learning may seem to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations for some of the most commonly used hyperparameters, in particular in the context of learning algorithms based on back-propagated gradient and gradient-based optimization. It also discusses how to deal with the fact that more interesting results can be obtained when allowing one to adjust many hyper-parameters. Overall, it describes elements of the practice used to successfully and efficiently train and debug large-scale and often deep multi-layer neural networks. It closes with open questions about the training difficulties observed with deeper architectures.

Yoshua Bengio

20. Training Deep and Recurrent Networks with Hessian-Free Optimization

In this chapter we will first describe the basic HF approach, and then examine well-known performance-improving techniques such as preconditioning which we have found to be beneficial for neural network training, as well as others of a more heuristic nature which are harder to justify, but which we have found to work well in practice. We will also provide practical tips for creating efficient and bug-free implementations and discuss various pitfalls which may arise when designing and using an HF-type approach in a particular application.

James Martens, Ilya Sutskever

21. Implementing Neural Networks Efficiently

Neural networks and machine learning algorithms in general require a flexible environment where new algorithm prototypes and experiments can be set up as quickly as possible with best possible computational performance. To that end, we provide a new framework called

Torch7

, that is especially suited to achieve both of these competing goals.

Torch7

is a versatile numeric computing framework and machine learning library that extends a very lightweight and powerful programming language Lua. Its goal is to provide a flexible environment to design, train and deploy learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines.

Torch7

can also easily be interfaced to third-party software thanks to Lua’s light C interface.

Ronan Collobert, Koray Kavukcuoglu, Clément Farabet

Better Representations: Invariant, Disentangled and Reusable

In many cases, the amount of labeled data is limited and does not allow for fully identifying the function that needs to be learned. When labeled data is scarce, the learning algorithm is exposed to simultaneous underfitting and overfitting. The learning algorithm starts to “invent” nonexistent regularities (overfitting) while at the same time not being able to model the true ones (underfitting). In the extreme case, this amounts to perfectly memorizing training data and not being able to generalize at all to new data.

Grégoire Montavon, Klaus-Robert Müller

22. Learning Feature Representations with K-Means

Many algorithms are available to learn deep hierarchies of features from unlabeled data, especially images. In many cases, these algorithms involve multi-layered networks of features (e.g., neural networks) that are sometimes tricky to train and tune and are difficult to scale up to many machines effectively. Recently, it has been found that K-means clustering can be used as a fast alternative training method. The main advantage of this approach is that it is very fast and easily implemented at large scale. On the other hand, employing this method in practice is not completely trivial: K-means has several limitations, and care must be taken to combine the right ingredients to get the system to work well. This chapter will summarize recent results and technical tricks that are needed to make effective use of K-means clustering for learning large-scale representations of images. We will also connect these results to other well-known algorithms to make clear when K-means can be most useful and convey intuitions about its behavior that are useful for debugging and engineering new systems.

Adam Coates, Andrew Y. Ng

23. Deep Big Multilayer Perceptrons for Digit Recognition

The competitive MNIST handwritten digit recognition benchmark has a long history of broken records since 1998. The most recent advancement by others dates back 8 years (error rate 0.4 old on-line back-propagation for plain multi-layer perceptrons yields a very low 0.35% error rate on the MNIST handwritten digits benchmark with a single MLP and 0.31% with a committee of seven MLP. All we need to achieve this until 2011 best result are many hidden layers, many neurons per layer, numerous deformed training images to avoid overfitting, and graphics cards to greatly speed up learning.

Dan Claudiu Cireşan, Ueli Meier, Luca Maria Gambardella, Jürgen Schmidhuber

24. A Practical Guide to Training Restricted Boltzmann Machines

Restricted Boltzmann machines (RBMs) have been used as generative models of many different types of data. RBMs are usually trained using the contrastive divergence learning procedure. This requires a certain amount of practical experience to decide how to set the values of numerical meta-parameters. Over the last few years, the machine learning group at the University of Toronto has acquired considerable expertise at training RBMs and this guide is an attempt to share this expertise with other machine learning researchers.

Geoffrey E. Hinton

25. Deep Boltzmann Machines and the Centering Trick

Deep Boltzmann machines are in theory capable of learning efficient representations of seemingly complex data. Designing an algorithm that effectively learns the data representation can be subject to multiple difficulties. In this chapter, we present the “centering trick” that consists of rewriting the energy of the system as a function of centered states. The centering trick improves the conditioning of the underlying optimization problem and makes learning more stable, leading to models with better generative and discriminative properties.

Grégoire Montavon, Klaus-Robert Müller

26. Deep Learning via Semi-supervised Embedding

We show how nonlinear semi-supervised embedding algorithms popular for use with “shallow” learning techniques such as kernel methods can be easily applied to deep multi-layer architectures, either as a regularizer at the output layer, or on each layer of the architecture. Compared to standard supervised backpropagation this can give significant gains. This trick provides a simple alternative to existing approaches to semi-supervised deep learning whilst yielding competitive error rates compared to those methods, and existing shallow semi-supervised techniques.

Jason Weston, Frédéric Ratle, Hossein Mobahi, Ronan Collobert

Identifying Dynamical Systems for Forecasting and Control

Identifying dynamical systems from data is a promising approach to data forecasting and optimal control. Data forecasting is an essential component of rational decision making in quantitative finance, marketing and planning. Optimal control systems, that is, systems that can sense the environment and react appropriately, enable the design of cost efficient gas turbines, smart grids and human-machine interfaces.

Grégoire Montavon, Klaus-Robert Müller

27. A Practical Guide to Applying Echo State Networks

Reservoir computing has emerged in the last decade as an alternative to gradient descent methods for training recurrent neural networks. Echo State Network (ESN) is one of the key reservoir computing “flavors”. While being practical, conceptually simple, and easy to implement, ESNs require some experience and insight to achieve the hailed good performance in many tasks. Here we present practical techniques and recommendations for successfully applying ESNs, as well as some more advanced application-specific modifications.

Mantas Lukoševičius

28. Forecasting with Recurrent Neural Networks: 12 Tricks

Recurrent neural networks (RNNs) are typically considered as relatively simple architectures, which come along with complicated learning algorithms. This paper has a different view: We start from the fact that RNNs can model any high dimensional, nonlinear dynamical system. Rather than focusing on learning algorithms, we concentrate on the design of network architectures. Unfolding in time is a well-known example of this modeling philosophy. Here a temporal algorithm is transferred into an architectural framework such that the learning can be performed by an extension of standard error backpropagation.

We introduce

12

tricks that not only provide deeper insights in the functioning of RNNs but also improve the identification of underlying dynamical system from data.

Hans-Georg Zimmermann, Christoph Tietz, Ralph Grothmann

29. Solving Partially Observable Reinforcement Learning Problems with Recurrent Neural Networks

The aim of this chapter is to provide a series of tricks and recipes for neural state estimation, particularly for real world applications of reinforcement learning. We use various topologies of recurrent neural networks as they allow to identify the continuous valued, possibly high dimensional state space of complex dynamical systems. Recurrent neural networks explicitly offer possibilities to account for time and memory, in principle they are able to model any type of dynamical system. Because of these capabilities recurrent neural networks are a suitable tool to approximate a Markovian state space of dynamical systems. In a second step, reinforcement learning methods can be applied to solve a defined control problem. Besides the trick of using a recurrent neural network for state estimation, various issues regarding real world problems such as, large sets of observables and long-term dependencies are addressed.

Siegmund Duell, Steffen Udluft, Volkmar Sterzing

30. 10 Steps and Some Tricks to Set up Neural Reinforcement Controllers

The paper discusses the steps necessary to set up a neural reinforcement controller for successfully solving typical (real world) control tasks. The major intention is to provide a code of practice of crucial steps that show how to transform control task requirements into the specification of a reinforcement learning task. Thereby, we do not necessarily claim that the way we propose is the only one (this would require a lot of empirical work, which is beyond the scope of the paper), but wherever possible we try to provide insights why we do it the one way or the other. Our procedure of setting up a neural reinforcement learning system worked well for a large range of real, realistic or benchmark-style control applications.

Martin Riedmiller

Springer Professional

About this book

Table of Contents

Frontmatter

Introduction

Introduction

Speeding Learning

Speeding Learning

1. Efficient BackProp

Regularization Techniques to Improve Generalization

Regularization Techniques to Improve Generalization

2. Early Stopping — But When?

3. A Simple Trick for Estimating the Weight Decay Parameter

4. Controlling the Hyperparameter Search in MacKay’s Bayesian Neural Network Framework

5. Adaptive Regularization in Neural Network Modeling

6. Large Ensemble Averaging

Improving Network Models and Algorithmic Tricks

Improving Network Models and Algorithmic Tricks

7. Square Unit Augmented, Radially Extended, Multilayer Perceptrons

8. A Dozen Tricks with Multitask Learning

9. Solving the Ill-Conditioning in Neural Network Learning

10. Centering Neural Network Gradient Factors

11. Avoiding Roundoff Error in Backpropagating Derivatives

Representing and Incorporating Prior Knowledge in Neural Network Training

Representing and Incorporating Prior Knowledge in Neural Network Training

12. Transformation Invariance in Pattern Recognition – Tangent Distance and Tangent Propagation

13. Combining Neural Networks and Context-Driven Search for On-line, Printed Handwriting Recognition in the Newton

14. Neural Network Classification and Prior Class Probabilities

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks

Tricks for Time Series

Tricks for Time Series

16. Forecasting the Economy with Neural Nets: A Survey of Challenges and Solutions

17. How to Train Neural Networks

Big Learning in Deep Neural Networks

Big Learning and Deep Neural Networks

18. Stochastic Gradient Descent Tricks

19. Practical Recommendations for Gradient-Based Training of Deep Architectures

20. Training Deep and Recurrent Networks with Hessian-Free Optimization

21. Implementing Neural Networks Efficiently

Better Representations: Invariant, Disentangled and Reusable

Better Representations: Invariant, Disentangled and Reusable

22. Learning Feature Representations with K-Means

23. Deep Big Multilayer Perceptrons for Digit Recognition

24. A Practical Guide to Training Restricted Boltzmann Machines

25. Deep Boltzmann Machines and the Centering Trick

26. Deep Learning via Semi-supervised Embedding

Identifying Dynamical Systems for Forecasting and Control

Identifying Dynamical Systems for Forecasting and Control

27. A Practical Guide to Applying Echo State Networks

28. Forecasting with Recurrent Neural Networks: 12 Tricks

29. Solving Partially Observable Reinforcement Learning Problems with Recurrent Neural Networks

30. 10 Steps and Some Tricks to Set up Neural Reinforcement Controllers

Backmatter

Premium Partner