Annealing stochastic approximation Monte Carlo algorithm for neural network training

Liang, Faming

doi:10.1007/s10994-007-5017-7

Annealing stochastic approximation Monte Carlo algorithm for neural network training

Published: 03 August 2007

Volume 68, pages 201–233, (2007)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Annealing stochastic approximation Monte Carlo algorithm for neural network training

Download PDF

Faming Liang¹

899 Accesses
28 Citations
Explore all metrics

Abstract

We propose a general-purpose stochastic optimization algorithm, the so-called annealing stochastic approximation Monte Carlo (ASAMC) algorithm, for neural network training. ASAMC can be regarded as a space annealing version of the stochastic approximation Monte Carlo (SAMC) algorithm. Under mild conditions, we show that ASAMC can converge weakly at a rate of Ω \((1/\sqrt{t})\) toward a neighboring set (in the space of energy) of the global minimizers. ASAMC is compared with simulated annealing, SAMC, and the BFGS algorithm for training MLPs on a number of examples. The numerical results indicate that ASAMC outperforms the other algorithms in both training and test errors. Like other stochastic algorithms, ASAMC requires longer training time than do the gradient-based algorithms. It provides, however, an efficient approach to train MLPs for which the energy landscape is rugged.

References

Abunawass, A. M., & Owen, C. B. (1993). A statistical analysis of the effect of noise injection during neural network training. SPIE Proceedings, 1966, 362–371.
Article Google Scholar
Amato, S., Apolloni, B., Caporali, G., Madesani, U., & Zanaboni, A. (1991). Simulated annealing approach in back-propagation. Neurocomputing, 3, 207–220.
Article Google Scholar
Andrieu, C., Moulines, E., & Priouret, P. (2005). Stability of Stochastic Approximation Under Verifiable Conditions. SIAM J. Control and Optimization, 44, 283–312.
Article MATH MathSciNet Google Scholar
Barron, A. (1993). Universal approximation bounds for superposition of a sigmoidal function. IEEE Transactions on Information Theory, 3, 930–945.
Article MathSciNet Google Scholar
Baum, E. B., & Lang, K. J. (1991). Constructing hidden units using examples and queries. In Advances in neural information processing systems (Vol. 3, pp. 904–910). San Mateo: Kaufmann.
Google Scholar
Benveniste, A., Métivier, M., & Priouret, P. (1990). Adaptive algorithms and stochastic approximations. New York: Springer.
MATH Google Scholar
Bharath, B., & Borkar, V. S. (1999). Stochastic approximation algorithms: overview and recent trends. Sadhana, 24, 425–452.
MATH MathSciNet Google Scholar
Billingsley, P. (1986). Probability and measure (2nd ed.). New York: Wiley.
MATH Google Scholar
Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press.
Google Scholar
Broyden, C. G. (1970a). The convergence of a class of double rank minimization algorithms, part I. Journal of the Institute of Mathematics and Applications, 6, 76–90.
Article MATH MathSciNet Google Scholar
Broyden, C. G. (1970b). The convergence of a class of double rank minimization algorithms, part II. Journal of the Institute of Mathematics and Applications, 6, 222–231.
Article MATH MathSciNet Google Scholar
Chawla, D., Li, L., & Scott, S. (2004). On approximating weighted sums with exponentially many terms. Journal of Computer and System Sciences, 69, 196–234.
Article MATH MathSciNet Google Scholar
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 3, 303–314.
Article MathSciNet Google Scholar
Davidon, W. C. (1959). Variable metric method for minimization. AEC Res. and Dev. Report ANL-5990.
de Freitas, N., Niranjan, M., Gee, A. H., & Doucet, A. (2000). Sequential Monte Carlo methods to train neural network models. Neural Computation, 12, 955–993.
Article Google Scholar
Delyon, B., Lavielle, M., & Moulines, E. (1999). Convergence of a stochastic approximation version of the EM algorithm. Annals of Statistics, 27, 94–128.
Article MATH MathSciNet Google Scholar
Erland, S. (2003). Adaptive Markov chain Monte Carlo review. Technical Report, Department of Mathematical Science, Norwegian University of Science and Technology.
Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. S. Touretzky (Ed.), Advances in neural information processing systems (Vol. 2, pp. 524–532). San Mateo: Kaufmann.
Google Scholar
Flanagan, J. A. (1997). Analyzing a self-organizing algorithm. Neural Networks, 10, 875–883.
Article Google Scholar
Fletcher, R. (1970). A new approach to variable metric algorithms. The Computer Journal, 13, 317–322.
Article MATH Google Scholar
Fletcher, R. (1987). Practical methods of optimization (2nd ed.). New York: Wiley.
MATH Google Scholar
Fletcher, R., & Powell, M. J. D. (1963). A rapidly convergent descent method for minimization. The Computer Journal, 6, 163–168.
MATH MathSciNet Google Scholar
Funahashi, K. (1989). On the approximate realization of continuous mappings by neural networks. Neural Networks, 2, 183–192.
Article Google Scholar
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741.
Article MATH Google Scholar
Gelfand, A. E., & Banerjee, S. (1998). Computing marginal posterior modes using stochastic approximation. Technical report, Department of Statistics, University of Connecticut.
Goldberg, D. E. (1989). Genetic algorithms in search, optimization, & machine learning, Reading: Addison–Wesley.
MATH Google Scholar
Goldfarb, D. (1970). A family of variable metric methods derived by variational means. Mathematics of Computation, 24, 23–26.
Article MATH MathSciNet Google Scholar
Goswami, G. R., & Liu, J. S. (2005) On real parameter evolutionary Monte Carlo algorithm. Technical Report, Harvard University.
Gu, M. G., & Kong, F. H. (1998). A stochastic approximation algorithm with Markov chain Monte Carlo method for incomplete data estimation problems. Proceedings of the National Academy of Sciences USA, 95, 7270–7274.
Article MATH MathSciNet Google Scholar
Gu, M. G., & Zhu, H. T. (2001). Maximum likelihood estimation for spatial models by Markov chain Monte Carlo stochastic approximation. Journal of the Royal Statistical Society, Series B, 63, 339–355.
Article MATH MathSciNet Google Scholar
Hanson, S. J. (1991). Behavioral diversity, search and stochastic connectionist systems. In M. Commons, S. Grossberg, & J. Staddon (Eds.), Neural network models of conditioning and action (pp. 295–345). Mahwah: Erlbaum.
Google Scholar
Harth, E., & Tzanakou, E. (1974). Alopex: a stochastic method for determining visual receptive fields. Vision Research, 14, 1475–1482.
Article Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York: Springer.
MATH Google Scholar
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109.
Article MATH Google Scholar
Haykin, S. (1999). Neural networks: a comprehensive foundation (2nd ed.). New York: Prentice Hall.
MATH Google Scholar
Holland, J. H. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press.
Google Scholar
Holley, R. A., Kusuoka, S., & Stroock, D. (1989). Asymptotic of the spectral gap with applications to the theory of simulated annealing. Journal of Functional Analysis, 83, 333–347.
Article MATH MathSciNet Google Scholar
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366.
Article Google Scholar
Ingman, D., & Merlis, Y. (1991). Local minimization escape using thermodynamic properties of neural networks. Neural Networks, 4, 395–404.
Article Google Scholar
Jerrum, M., & Sinclair, A. (1997). The Markov chain Monte Carlo method: an approach to approximate counting and integration. In D. S. Hochbaum (Ed.), Approximation algorithms for NP-hard problems (pp. 482–520). Boston: PWS Publishing Company.
Google Scholar
Jasra, A., Stephens, D. A., & Holmes, C. C. (2007, to appear). On population-based simulation for static inference. Statistics and Computing.
Karlin, S., & Taylor, H. M. (1998). An introduction to stochastic modeling, (3rd ed.). Orlando: Academic Press.
MATH Google Scholar
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220, 671–680.
Article MathSciNet Google Scholar
Kohonen, T. (1990). The self-organizing map. Proceedings of the Institute of Electrical and Electronics Engineers, 78, 1464–1480.
Google Scholar
Lang, K. J., & Witbrock, M. J. (1989). Learning to tell two spirals apart. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.), Proceedings of the 1988 connectionist models (pp. 52–59). San Mateo: Kaufmann.
Google Scholar
Levenberg, K. (1944). A method for the solution of certain non-linear problems in least squares. Quarterly Journal of Applied Mathematics, 2, 164–168.
MATH MathSciNet Google Scholar
Liang, F. (2003). An effective Bayesian neural network classifier with a comparison study to support vector machine. Neural Computation, 15, 1959–1989.
Article MATH Google Scholar
Liang, F. (2005a). Generalized Wang-Landau algorithm for Monte Carlo computation. Journal of the American Statistical Association, 100, 1311–1327.
Article MATH MathSciNet Google Scholar
Liang, F. (2005b). Evidence evaluation for Bayesian neural networks using contour Monte Carlo. Neural Computation, 17, 1385–1410.
Article MATH Google Scholar
Liang, F., & Wong, W. H. (2001). Real parameter evolutionary Monte Carlo with applications in Bayesian mixture models. Journal of the American Statistical Association, 96, 653–666.
Article MATH MathSciNet Google Scholar
Liang, F., Liu, C., & Carroll, R. J. (2007). Stochastic approximation in Monte Carlo computation. Journal of the American Statistical Association, 102, 305–320.
Article MATH MathSciNet Google Scholar
MacKay, D. J. C. (1992a). A practical Bayesian framework for backprop networks. Neural Computation, 4, 448–472.
Google Scholar
MacKay, D. J. C. (1992b). The evidence framework applied to classification problems. Neural Computation, 4, 720–736.
Article Google Scholar
Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society of Industrial and Applied Mathematics, 11, 431–441.
Article MATH MathSciNet Google Scholar
Mengersen, K. L., & Tweedie, R. L. (1996). Rates of convergence of the Hastings and Metropolis algorithms. The Annals of Statistics, 24, 101–121.
Article MATH MathSciNet Google Scholar
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091.
Article Google Scholar
Meyn, S. P., & Tweedie, R. L. (1993). Markov chains and stochastic stability. London: Springer.
MATH Google Scholar
Mulier, F. M., & Cherkassky, V. S. (1995). Statistical analysis of self-organization. Neural Networks, 8, 717–727.
Article Google Scholar
Müller, P., & Insua, D. R. (1998). Issues in Bayesian analysis of neural network models. Neural Computation, 10, 749–770.
Article Google Scholar
Neal, R. M. (1996). Bayesian learning for neural networks. New York: Springer.
MATH Google Scholar
Owen, C. B., & Abunawass, A. M. (1993). Applications of simulated annealing to the back-propagation model improves convergence. SPIE Proceedings, 1966, 269–276.
Article Google Scholar
Perrone, M. P. (1993).Improving regression estimation: averaging methods for variance reduction with extension, to general convex measure optimization. PhD thesis, Brown University, Rhode Island.
Polak, E. (1971). Computational methods in optimization: a unified approach. New York: Academic Press.
Google Scholar
Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407.
MATH MathSciNet Google Scholar
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by back-propagating errors. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: explorations in the microstructure of cognition (Vol. 1, pp. 318–362). Cambridge: MIT Press.
Google Scholar
Sastry, P. S., Magesh, M., & Unnikrishnan, K. P. (2002). Two timescale analysis of the Alopex algorithm for optimization. Neural Computation, 14, 2729–2750.
Article MATH Google Scholar
Scheffé, H. (1947). A useful convergence theorem for probability distributions. Annals of Mathematical Statistics, 18, 434–438.
MATH MathSciNet Google Scholar
Shanno, D. F. (1970). Conditioning of quasi-Newton methods for function minimization. Mathematics of Computation, 24, 647–656.
Article MathSciNet Google Scholar
Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C., & Johannes, R. S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In R. A. Greenes (Ed.), Proceedings pf the symposium on computer applications in medical care (pp. 261–265). Los Alamitos: IEEE Computer Society Press.
Google Scholar
Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37, 332–341.
Article MATH MathSciNet Google Scholar
Szu, H. (1986). Fast simulated annealing. In AIP conference proceedings: Vol. 151. Neural network for computing, Snowbird, UT.
Tang, Z., Wang, X., Tamura, H., & Ishii, M. (2003). An algorithm of supervised learning for multilayer neural networks. Neural Computation, 15, 1125–1142.
Article MATH Google Scholar
Tesauro, G., & Janssens, B. (1988). Scaling relations in back-propagation learning. Complex System, 2, 39–44.
MATH Google Scholar
van Rooij, A. J. F., Jain, L. C., & Johnson, R. P. (1996). Neural network training using genetic algorithms. Singapore: World Scientific.
Google Scholar
von Lehmen, A., Paek, E. G., Liao, P. F., Marrakchi, A., & Patel, J. S. (1988). Factors influencing learning by back-propagation. In Proceedings of IEEE international conference on neural networks (pp. 335–341). New York: IEEE Press.
Chapter Google Scholar
Wang, F., & Landau, D. P. (2001). Efficient, multiple-range random walk algorithm to calculate the density of states. Physical Review Letters, 86, 2050–2053.
Article Google Scholar
Wang, C., & Principe, J. C. (1999). Training neural networks with additive noise in the desired signal. IEEE Transactions on Neural Networks, 10, 1511–1517.
Article Google Scholar
Wong, W. H., & Liang, F. (1997). Dynamic weighting in Monte Carlo and optimization. Proceedings of the National Academy of Sciences USA, 94, 14220–14224.
Article MATH MathSciNet Google Scholar
Wouwer, A. V., Renotte, C., & Remy, M. (1999) On the use of simultaneous perturbation stochastic approximation for neural network training. In Proceedings of the American control conference (pp. 388–392), San Diego, CA.

Download references

Author information

Authors and Affiliations

Department of Statistics, Texas A&M University, College Station, TX, 77843-3143, USA
Faming Liang

Authors

Faming Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Faming Liang.

Additional information

Action Editor: Risto Miikkulainen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liang, F. Annealing stochastic approximation Monte Carlo algorithm for neural network training. Mach Learn 68, 201–233 (2007). https://doi.org/10.1007/s10994-007-5017-7

Download citation

Received: 25 June 2005
Revised: 17 June 2007
Accepted: 22 June 2007
Published: 03 August 2007
Issue Date: October 2007
DOI: https://doi.org/10.1007/s10994-007-5017-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Annealing stochastic approximation Monte Carlo algorithm for neural network training

Abstract

Article PDF

Similar content being viewed by others

Embedding Simulated Annealing within Stochastic Gradient Descent

Some Remarks on Replicated Simulated Annealing

Traditional and Accelerated Gradient Descent for Neural Architecture Search

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Annealing stochastic approximation Monte Carlo algorithm for neural network training

Abstract

Article PDF

Similar content being viewed by others

Embedding Simulated Annealing within Stochastic Gradient Descent

Some Remarks on Replicated Simulated Annealing

Traditional and Accelerated Gradient Descent for Neural Architecture Search

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation