Abstract
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small gap between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training.
Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice.
We interpret our experimental findings by comparison with traditional models.
We supplement this republication with a new section at the end summarizing recent progresses in the field since the original version of this paper.
- Arora, S., Cohen, N., Hu, W., Luo, Y. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché Buc, E. Fox, and R. Garnett, eds. Vol. 32. Curran Associates, Inc., 2019, 7411--7422.Google Scholar
- Arora, S., Ge, R., Neyshabur, B., Zhang, Y. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning. J. Dy and A. Krause, eds. 2018, 254--263.Google Scholar
- Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M.S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y., Lacoste-Julien, S. A closer look at memorization in deep networks. In International Conference on Machine Learning. D. Precup and Y.W. Teh, eds. 2017, 233--242.Google Scholar
- Bartlett, P.L. The sample complexity of pattern classification with neural networks---The size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory, 44 (1998), 525--536.Google ScholarDigital Library
- Bartlett, P.L., Foster, D.J., Telgarsky, M.J. Spectrally-normalized margin bounds for neural networks. Adv. Neural Inform. Process. Syst. 2017, 6240--6249.Google Scholar
- Belkin, M., Hsu, D., Ma, S., Mandal, S. Reconciling modern machine-learning practice and the classical bias--variance trade-off. Proc. Natl. Acad. Sci. 32, 116 (2019), 15849--15854.Google Scholar
- Belkin, M., Hsu, D.J., Mitra, P. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. Adv. Neural Inform. Process. Syst., 2018, 2300--2311.Google Scholar
- Bousquet, O., Elisseeff, A. Stability and generalization. J. Mach. Learn. Res. 2 (2002), 499--526.Google ScholarDigital Library
- Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B., LeCun, Y. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics. G. Lebanon and S.V.N. Vishwanathan, eds. 2015, 192--204.Google Scholar
- Coates, A., Ng, A.Y. Learning feature representations with k-means. In Neural Networks: Tricks of the Trade, Reloaded. Springer, 2012.Google ScholarCross Ref
- Cohen, N., Shashua, A. Convolutional rectifier networks as generalized tensor decompositions. In International Conference on Machine Learning. M.F. Balcan and K.Q. Weinberger, eds. 2016, 955--963.Google Scholar
- Cybenko, G. Approximation by superposition of sigmoidal functions. Math. Control Signal. Syst. 4, 2 (1989), 303--314.Google Scholar
- Delalleau, O., Bengio, Y. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems 24. J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, K. Weinberger, eds. Curran Associates, Inc., 2011, 666--674Google Scholar
- Dziugaite, G.K., Roy, D.M. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence. 2017.Google Scholar
- Eldan, R., Shamir, O. The power of depth for feedforward neural networks. In Conference on Learning Theory. V. Feldman, A. Rakhlin, and O. Shamir, eds. 2016, 907--940.Google Scholar
- Feldman V. Does learning require memorization? a short tale about a long tail. arXiv preprint arXiv:1906.05271 (2019).Google Scholar
- Golowich, N., Rakhlin, A., Shamir, O. Size-independent sample complexity of neural networks. In Conference On Learning Theory. P.R. Sébastien Bubeck, V. Perchet, eds., 2018, 297--299.Google Scholar
- Hardt, M., Recht, B., Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning. M.F. Balcan and K.Q. Weinberger, eds. 2016, 1225--1234.Google Scholar
- Kawaguchi, K., Kaelbling, L.P., Bengio, Y. Generalization in deep learning. CoRR, arXiv:1710.05468 (2017).Google Scholar
- Liang, T., Poggio, T., Rakhlin, A., Stokes, J. Fisher-rao metric, geometry, and complexity of neural networks. In The 22nd International Conference on Artificial Intelligence and Statistics. K. Chaudhuri and M. Sugiyama, eds. arXiv:1711.01530 (2017), 888--896.Google Scholar
- Lin, J., Camoriano, R., Rosasco, L. Generalization properties and implicit regularization for multiple passes SGM. In International Conference on Machine Learning. M.F. Balcan and K.Q. Weinberger, eds. 2016, 2340--2348.Google Scholar
- Livni, R., Shalev-Shwartz, S., Shamir, O. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems 27. Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, eds. Curran Associates, Inc., 2014, 855--863.Google Scholar
- Ma, S., Bassily, R., Belkin, M. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning. J. Dy and A. Krause, eds. 2018, 3325--3334.Google Scholar
- Mhaskar, H. Poggio, T.A. Deep vs. shallow networks: An approximation theory perspective. Anal. Appl. 6, 14 (2016).Google Scholar
- Mhaskar, H.N. Approximation properties of a multilayered feedforward artificial neural network. Adv. Comput. Math. 1, 1 (1993), 61--80.Google ScholarCross Ref
- Morcos, A., Raghu, M., Bengio, S. Insights on representational similarity in neural networks with canonical correlation. Adv. Neural Inform. Process. Syst.. 2018, 5727--5736.Google Scholar
- Mukherjee, S., Niyogi, P., Poggio, T., Rifkin R. Statistical learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Technical Report AI Memo 2002-024. Massachusetts Institute of Technology, 2002.Google Scholar
- Neyshabur, B., Bhojanapalli, S., McAllester, D., Srebro, N. Exploring generalization in deep learning. Adv. Neural Inform. Process. Syst., 2017, 5947--5956.Google Scholar
- Neyshabur, B., Bhojanapalli, S., Srebro, N. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.Google Scholar
- Neyshabur, B., Tomioka, R., Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. CoRR, abs/1412.6614, 2014.Google Scholar
- Neyshabur, B., Tomioka, R., Srebro, N. Norm-based capacity control in neural networks. In Conference on Learning Theory. S.K. Peter Grünwald and E. Hazan, eds. 2015, 1376--1401.Google Scholar
- Poggio, T., Rifkin, R., Mukherjee, S., Niyogi, P. General conditions for predictivity in learning theory. Nature 6981, 428 (2004), 419--422.Google Scholar
- Recht, B., Roelofs, R., Schmidt, L., Shankar, V. Do imagenet classifiers generalize to imagenet? arXiv preprint arXiv:1902.10811 (2019).Google Scholar
- Schölkopf, B., Herbrich, R., Smola, A.J. A generalized representer theorem. In Conference on Learning Theory. 2001, 416--426.Google ScholarCross Ref
- Shah, V., Kyrillidis, A., Sanghavi, S. Minimum norm solutions do not always generalize well for over-parameterized problems. CoRR. arXiv:1811.07055 (2018).Google Scholar
- Shalev-Shwartz, S., Shamir, O., Srebro, N., Sridharan, K. Learnability, stability and uniform convergence. J. Mach. Learn. Res., 11 (2010), 2635--2670.Google ScholarDigital Library
- Smith, S.L., Le, Q.V. A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, 2018.Google Scholar
- Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S., Srebro, N. The implicit bias of gradient descent on separable data. J. Mach. Learn. Res. 70, 19 (2018), 1--57.Google Scholar
- Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 1, 15 (2014), 1929--1958.Google Scholar
- Telgarsky, M. Benefits of depth in neural networks. In Conference on Learning Theory. V. Feldman, A. Rakhlin, and O. Shamir, eds. 2016, 1517--1539.Google Scholar
- Toneva, M., Sordoni, A., des Combes, R.T., Trischler, A., Bengio, Y., Gordon, G.J. An empirical study of example forgetting during deep neural network learning. In ICLR, 2019.Google Scholar
- Vapnik, V.N. Statistical Learning Theory. Adaptive and Learning Systems for Signal Processing, Communications, and Control. Wiley, 1998.Google Scholar
- Yao, Y., Rosasco, L., Caponnetto, A. On early stopping in gradient descent learning. Const. Approx. 2, 26 (2007), 289--315.Google Scholar
- Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations. 2017.Google Scholar
- Zhao, S., Ren, H., Yuan, A., Song, J., Goodman, N., Ermon, S. Bias and generalization in deep generative models: An empirical study. In Advances in Neural Information Processing Systems 31. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds. Curran Associates, Inc., 2018, 10792--10801.Google Scholar
- Zhou, W., Veitch, V., Austern, M., Adams, R.P., Orbanz, P. Non-vacuous generalization bounds at the ImageNet scale: A PAC-Bayesian compression approach. In International Conference on Learning Representations, 2019.Google Scholar
Index Terms
- Understanding deep learning (still) requires rethinking generalization
Recommendations
Deep learning: an overview and main paradigms
In the present paper, we examine and analyze main paradigms of learning of multilayer neural networks starting with a single layer perceptron and ending with deep neural networks, which are considered regarded as a breakthrough in the field of the ...
Unsupervised Learning of Human Action Categories in Still Images with Deep Representations
In this article, we propose a novel method for unsupervised learning of human action categories in still images. In contrast to previous methods, the proposed method explores distinctive information of actions directly from unlabeled image databases, ...
Comments