Abstract
We propose a general-purpose stochastic optimization algorithm, the so-called annealing stochastic approximation Monte Carlo (ASAMC) algorithm, for neural network training. ASAMC can be regarded as a space annealing version of the stochastic approximation Monte Carlo (SAMC) algorithm. Under mild conditions, we show that ASAMC can converge weakly at a rate of Ω \((1/\sqrt{t})\) toward a neighboring set (in the space of energy) of the global minimizers. ASAMC is compared with simulated annealing, SAMC, and the BFGS algorithm for training MLPs on a number of examples. The numerical results indicate that ASAMC outperforms the other algorithms in both training and test errors. Like other stochastic algorithms, ASAMC requires longer training time than do the gradient-based algorithms. It provides, however, an efficient approach to train MLPs for which the energy landscape is rugged.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Abunawass, A. M., & Owen, C. B. (1993). A statistical analysis of the effect of noise injection during neural network training. SPIE Proceedings, 1966, 362–371.
Amato, S., Apolloni, B., Caporali, G., Madesani, U., & Zanaboni, A. (1991). Simulated annealing approach in back-propagation. Neurocomputing, 3, 207–220.
Andrieu, C., Moulines, E., & Priouret, P. (2005). Stability of Stochastic Approximation Under Verifiable Conditions. SIAM J. Control and Optimization, 44, 283–312.
Barron, A. (1993). Universal approximation bounds for superposition of a sigmoidal function. IEEE Transactions on Information Theory, 3, 930–945.
Baum, E. B., & Lang, K. J. (1991). Constructing hidden units using examples and queries. In Advances in neural information processing systems (Vol. 3, pp. 904–910). San Mateo: Kaufmann.
Benveniste, A., Métivier, M., & Priouret, P. (1990). Adaptive algorithms and stochastic approximations. New York: Springer.
Bharath, B., & Borkar, V. S. (1999). Stochastic approximation algorithms: overview and recent trends. Sadhana, 24, 425–452.
Billingsley, P. (1986). Probability and measure (2nd ed.). New York: Wiley.
Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press.
Broyden, C. G. (1970a). The convergence of a class of double rank minimization algorithms, part I. Journal of the Institute of Mathematics and Applications, 6, 76–90.
Broyden, C. G. (1970b). The convergence of a class of double rank minimization algorithms, part II. Journal of the Institute of Mathematics and Applications, 6, 222–231.
Chawla, D., Li, L., & Scott, S. (2004). On approximating weighted sums with exponentially many terms. Journal of Computer and System Sciences, 69, 196–234.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 3, 303–314.
Davidon, W. C. (1959). Variable metric method for minimization. AEC Res. and Dev. Report ANL-5990.
de Freitas, N., Niranjan, M., Gee, A. H., & Doucet, A. (2000). Sequential Monte Carlo methods to train neural network models. Neural Computation, 12, 955–993.
Delyon, B., Lavielle, M., & Moulines, E. (1999). Convergence of a stochastic approximation version of the EM algorithm. Annals of Statistics, 27, 94–128.
Erland, S. (2003). Adaptive Markov chain Monte Carlo review. Technical Report, Department of Mathematical Science, Norwegian University of Science and Technology.
Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. S. Touretzky (Ed.), Advances in neural information processing systems (Vol. 2, pp. 524–532). San Mateo: Kaufmann.
Flanagan, J. A. (1997). Analyzing a self-organizing algorithm. Neural Networks, 10, 875–883.
Fletcher, R. (1970). A new approach to variable metric algorithms. The Computer Journal, 13, 317–322.
Fletcher, R. (1987). Practical methods of optimization (2nd ed.). New York: Wiley.
Fletcher, R., & Powell, M. J. D. (1963). A rapidly convergent descent method for minimization. The Computer Journal, 6, 163–168.
Funahashi, K. (1989). On the approximate realization of continuous mappings by neural networks. Neural Networks, 2, 183–192.
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741.
Gelfand, A. E., & Banerjee, S. (1998). Computing marginal posterior modes using stochastic approximation. Technical report, Department of Statistics, University of Connecticut.
Goldberg, D. E. (1989). Genetic algorithms in search, optimization, & machine learning, Reading: Addison–Wesley.
Goldfarb, D. (1970). A family of variable metric methods derived by variational means. Mathematics of Computation, 24, 23–26.
Goswami, G. R., & Liu, J. S. (2005) On real parameter evolutionary Monte Carlo algorithm. Technical Report, Harvard University.
Gu, M. G., & Kong, F. H. (1998). A stochastic approximation algorithm with Markov chain Monte Carlo method for incomplete data estimation problems. Proceedings of the National Academy of Sciences USA, 95, 7270–7274.
Gu, M. G., & Zhu, H. T. (2001). Maximum likelihood estimation for spatial models by Markov chain Monte Carlo stochastic approximation. Journal of the Royal Statistical Society, Series B, 63, 339–355.
Hanson, S. J. (1991). Behavioral diversity, search and stochastic connectionist systems. In M. Commons, S. Grossberg, & J. Staddon (Eds.), Neural network models of conditioning and action (pp. 295–345). Mahwah: Erlbaum.
Harth, E., & Tzanakou, E. (1974). Alopex: a stochastic method for determining visual receptive fields. Vision Research, 14, 1475–1482.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York: Springer.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109.
Haykin, S. (1999). Neural networks: a comprehensive foundation (2nd ed.). New York: Prentice Hall.
Holland, J. H. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press.
Holley, R. A., Kusuoka, S., & Stroock, D. (1989). Asymptotic of the spectral gap with applications to the theory of simulated annealing. Journal of Functional Analysis, 83, 333–347.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366.
Ingman, D., & Merlis, Y. (1991). Local minimization escape using thermodynamic properties of neural networks. Neural Networks, 4, 395–404.
Jerrum, M., & Sinclair, A. (1997). The Markov chain Monte Carlo method: an approach to approximate counting and integration. In D. S. Hochbaum (Ed.), Approximation algorithms for NP-hard problems (pp. 482–520). Boston: PWS Publishing Company.
Jasra, A., Stephens, D. A., & Holmes, C. C. (2007, to appear). On population-based simulation for static inference. Statistics and Computing.
Karlin, S., & Taylor, H. M. (1998). An introduction to stochastic modeling, (3rd ed.). Orlando: Academic Press.
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220, 671–680.
Kohonen, T. (1990). The self-organizing map. Proceedings of the Institute of Electrical and Electronics Engineers, 78, 1464–1480.
Lang, K. J., & Witbrock, M. J. (1989). Learning to tell two spirals apart. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.), Proceedings of the 1988 connectionist models (pp. 52–59). San Mateo: Kaufmann.
Levenberg, K. (1944). A method for the solution of certain non-linear problems in least squares. Quarterly Journal of Applied Mathematics, 2, 164–168.
Liang, F. (2003). An effective Bayesian neural network classifier with a comparison study to support vector machine. Neural Computation, 15, 1959–1989.
Liang, F. (2005a). Generalized Wang-Landau algorithm for Monte Carlo computation. Journal of the American Statistical Association, 100, 1311–1327.
Liang, F. (2005b). Evidence evaluation for Bayesian neural networks using contour Monte Carlo. Neural Computation, 17, 1385–1410.
Liang, F., & Wong, W. H. (2001). Real parameter evolutionary Monte Carlo with applications in Bayesian mixture models. Journal of the American Statistical Association, 96, 653–666.
Liang, F., Liu, C., & Carroll, R. J. (2007). Stochastic approximation in Monte Carlo computation. Journal of the American Statistical Association, 102, 305–320.
MacKay, D. J. C. (1992a). A practical Bayesian framework for backprop networks. Neural Computation, 4, 448–472.
MacKay, D. J. C. (1992b). The evidence framework applied to classification problems. Neural Computation, 4, 720–736.
Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society of Industrial and Applied Mathematics, 11, 431–441.
Mengersen, K. L., & Tweedie, R. L. (1996). Rates of convergence of the Hastings and Metropolis algorithms. The Annals of Statistics, 24, 101–121.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091.
Meyn, S. P., & Tweedie, R. L. (1993). Markov chains and stochastic stability. London: Springer.
Mulier, F. M., & Cherkassky, V. S. (1995). Statistical analysis of self-organization. Neural Networks, 8, 717–727.
Müller, P., & Insua, D. R. (1998). Issues in Bayesian analysis of neural network models. Neural Computation, 10, 749–770.
Neal, R. M. (1996). Bayesian learning for neural networks. New York: Springer.
Owen, C. B., & Abunawass, A. M. (1993). Applications of simulated annealing to the back-propagation model improves convergence. SPIE Proceedings, 1966, 269–276.
Perrone, M. P. (1993).Improving regression estimation: averaging methods for variance reduction with extension, to general convex measure optimization. PhD thesis, Brown University, Rhode Island.
Polak, E. (1971). Computational methods in optimization: a unified approach. New York: Academic Press.
Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by back-propagating errors. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: explorations in the microstructure of cognition (Vol. 1, pp. 318–362). Cambridge: MIT Press.
Sastry, P. S., Magesh, M., & Unnikrishnan, K. P. (2002). Two timescale analysis of the Alopex algorithm for optimization. Neural Computation, 14, 2729–2750.
Scheffé, H. (1947). A useful convergence theorem for probability distributions. Annals of Mathematical Statistics, 18, 434–438.
Shanno, D. F. (1970). Conditioning of quasi-Newton methods for function minimization. Mathematics of Computation, 24, 647–656.
Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C., & Johannes, R. S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In R. A. Greenes (Ed.), Proceedings pf the symposium on computer applications in medical care (pp. 261–265). Los Alamitos: IEEE Computer Society Press.
Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37, 332–341.
Szu, H. (1986). Fast simulated annealing. In AIP conference proceedings: Vol. 151. Neural network for computing, Snowbird, UT.
Tang, Z., Wang, X., Tamura, H., & Ishii, M. (2003). An algorithm of supervised learning for multilayer neural networks. Neural Computation, 15, 1125–1142.
Tesauro, G., & Janssens, B. (1988). Scaling relations in back-propagation learning. Complex System, 2, 39–44.
van Rooij, A. J. F., Jain, L. C., & Johnson, R. P. (1996). Neural network training using genetic algorithms. Singapore: World Scientific.
von Lehmen, A., Paek, E. G., Liao, P. F., Marrakchi, A., & Patel, J. S. (1988). Factors influencing learning by back-propagation. In Proceedings of IEEE international conference on neural networks (pp. 335–341). New York: IEEE Press.
Wang, F., & Landau, D. P. (2001). Efficient, multiple-range random walk algorithm to calculate the density of states. Physical Review Letters, 86, 2050–2053.
Wang, C., & Principe, J. C. (1999). Training neural networks with additive noise in the desired signal. IEEE Transactions on Neural Networks, 10, 1511–1517.
Wong, W. H., & Liang, F. (1997). Dynamic weighting in Monte Carlo and optimization. Proceedings of the National Academy of Sciences USA, 94, 14220–14224.
Wouwer, A. V., Renotte, C., & Remy, M. (1999) On the use of simultaneous perturbation stochastic approximation for neural network training. In Proceedings of the American control conference (pp. 388–392), San Diego, CA.
Author information
Authors and Affiliations
Corresponding author
Additional information
Action Editor: Risto Miikkulainen.
Rights and permissions
About this article
Cite this article
Liang, F. Annealing stochastic approximation Monte Carlo algorithm for neural network training. Mach Learn 68, 201–233 (2007). https://doi.org/10.1007/s10994-007-5017-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-007-5017-7