Understanding deep learning (still) requires rethinking generalization

Authors:
Chiyuan Zhang

Google Brain, Mountain View, CA

Google Brain, Mountain View, CA
View Profile

,
Samy Bengio

Google Brain, Mountain View, CA

Google Brain, Mountain View, CA
View Profile

,
Moritz Hardt

University of California, Berkeley, CA

University of California, Berkeley, CA
View Profile

,
Benjamin Recht

University of California, Berkeley, CA

University of California, Berkeley, CA
View Profile

,
Oriol Vinyals

DeepMind, London N1C 4AG, U.K

DeepMind, London N1C 4AG, U.K
View Profile

Authors Info & Claims

Communications of the ACM Volume 64 Issue 3March 2021pp 107–115https://doi.org/10.1145/3446776

Published:22 February 2021Publication History

Communications of the ACM

Abstract

Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small gap between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training.

Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice.

We interpret our experimental findings by comparison with traditional models.

We supplement this republication with a new section at the end summarizing recent progresses in the field since the original version of this paper.

References

Arora, S., Cohen, N., Hu, W., Luo, Y. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché Buc, E. Fox, and R. Garnett, eds. Vol. 32. Curran Associates, Inc., 2019, 7411--7422.Google Scholar
Arora, S., Ge, R., Neyshabur, B., Zhang, Y. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning. J. Dy and A. Krause, eds. 2018, 254--263.Google Scholar
Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M.S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y., Lacoste-Julien, S. A closer look at memorization in deep networks. In International Conference on Machine Learning. D. Precup and Y.W. Teh, eds. 2017, 233--242.Google Scholar
Bartlett, P.L. The sample complexity of pattern classification with neural networks---The size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory, 44 (1998), 525--536.Google ScholarDigital Library
Bartlett, P.L., Foster, D.J., Telgarsky, M.J. Spectrally-normalized margin bounds for neural networks. Adv. Neural Inform. Process. Syst. 2017, 6240--6249.Google Scholar
Belkin, M., Hsu, D., Ma, S., Mandal, S. Reconciling modern machine-learning practice and the classical bias--variance trade-off. Proc. Natl. Acad. Sci. 32, 116 (2019), 15849--15854.Google Scholar
Belkin, M., Hsu, D.J., Mitra, P. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. Adv. Neural Inform. Process. Syst., 2018, 2300--2311.Google Scholar
Bousquet, O., Elisseeff, A. Stability and generalization. J. Mach. Learn. Res. 2 (2002), 499--526.Google ScholarDigital Library
Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B., LeCun, Y. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics. G. Lebanon and S.V.N. Vishwanathan, eds. 2015, 192--204.Google Scholar
Coates, A., Ng, A.Y. Learning feature representations with k-means. In Neural Networks: Tricks of the Trade, Reloaded. Springer, 2012.Google ScholarCross Ref
Cohen, N., Shashua, A. Convolutional rectifier networks as generalized tensor decompositions. In International Conference on Machine Learning. M.F. Balcan and K.Q. Weinberger, eds. 2016, 955--963.Google Scholar
Cybenko, G. Approximation by superposition of sigmoidal functions. Math. Control Signal. Syst. 4, 2 (1989), 303--314.Google Scholar
Delalleau, O., Bengio, Y. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems 24. J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, K. Weinberger, eds. Curran Associates, Inc., 2011, 666--674Google Scholar
Dziugaite, G.K., Roy, D.M. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence. 2017.Google Scholar
Eldan, R., Shamir, O. The power of depth for feedforward neural networks. In Conference on Learning Theory. V. Feldman, A. Rakhlin, and O. Shamir, eds. 2016, 907--940.Google Scholar
Feldman V. Does learning require memorization? a short tale about a long tail. arXiv preprint arXiv:1906.05271 (2019).Google Scholar
Golowich, N., Rakhlin, A., Shamir, O. Size-independent sample complexity of neural networks. In Conference On Learning Theory. P.R. Sébastien Bubeck, V. Perchet, eds., 2018, 297--299.Google Scholar
Hardt, M., Recht, B., Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning. M.F. Balcan and K.Q. Weinberger, eds. 2016, 1225--1234.Google Scholar
Kawaguchi, K., Kaelbling, L.P., Bengio, Y. Generalization in deep learning. CoRR, arXiv:1710.05468 (2017).Google Scholar
Liang, T., Poggio, T., Rakhlin, A., Stokes, J. Fisher-rao metric, geometry, and complexity of neural networks. In The 22nd International Conference on Artificial Intelligence and Statistics. K. Chaudhuri and M. Sugiyama, eds. arXiv:1711.01530 (2017), 888--896.Google Scholar
Lin, J., Camoriano, R., Rosasco, L. Generalization properties and implicit regularization for multiple passes SGM. In International Conference on Machine Learning. M.F. Balcan and K.Q. Weinberger, eds. 2016, 2340--2348.Google Scholar
Livni, R., Shalev-Shwartz, S., Shamir, O. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems 27. Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, eds. Curran Associates, Inc., 2014, 855--863.Google Scholar
Ma, S., Bassily, R., Belkin, M. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning. J. Dy and A. Krause, eds. 2018, 3325--3334.Google Scholar
Mhaskar, H. Poggio, T.A. Deep vs. shallow networks: An approximation theory perspective. Anal. Appl. 6, 14 (2016).Google Scholar
Mhaskar, H.N. Approximation properties of a multilayered feedforward artificial neural network. Adv. Comput. Math. 1, 1 (1993), 61--80.Google ScholarCross Ref
Morcos, A., Raghu, M., Bengio, S. Insights on representational similarity in neural networks with canonical correlation. Adv. Neural Inform. Process. Syst.. 2018, 5727--5736.Google Scholar
Mukherjee, S., Niyogi, P., Poggio, T., Rifkin R. Statistical learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Technical Report AI Memo 2002-024. Massachusetts Institute of Technology, 2002.Google Scholar
Neyshabur, B., Bhojanapalli, S., McAllester, D., Srebro, N. Exploring generalization in deep learning. Adv. Neural Inform. Process. Syst., 2017, 5947--5956.Google Scholar
Neyshabur, B., Bhojanapalli, S., Srebro, N. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.Google Scholar
Neyshabur, B., Tomioka, R., Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. CoRR, abs/1412.6614, 2014.Google Scholar
Neyshabur, B., Tomioka, R., Srebro, N. Norm-based capacity control in neural networks. In Conference on Learning Theory. S.K. Peter Grünwald and E. Hazan, eds. 2015, 1376--1401.Google Scholar
Poggio, T., Rifkin, R., Mukherjee, S., Niyogi, P. General conditions for predictivity in learning theory. Nature 6981, 428 (2004), 419--422.Google Scholar
Recht, B., Roelofs, R., Schmidt, L., Shankar, V. Do imagenet classifiers generalize to imagenet? arXiv preprint arXiv:1902.10811 (2019).Google Scholar
Schölkopf, B., Herbrich, R., Smola, A.J. A generalized representer theorem. In Conference on Learning Theory. 2001, 416--426.Google ScholarCross Ref
Shah, V., Kyrillidis, A., Sanghavi, S. Minimum norm solutions do not always generalize well for over-parameterized problems. CoRR. arXiv:1811.07055 (2018).Google Scholar
Shalev-Shwartz, S., Shamir, O., Srebro, N., Sridharan, K. Learnability, stability and uniform convergence. J. Mach. Learn. Res., 11 (2010), 2635--2670.Google ScholarDigital Library
Smith, S.L., Le, Q.V. A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, 2018.Google Scholar
Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S., Srebro, N. The implicit bias of gradient descent on separable data. J. Mach. Learn. Res. 70, 19 (2018), 1--57.Google Scholar
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 1, 15 (2014), 1929--1958.Google Scholar
Telgarsky, M. Benefits of depth in neural networks. In Conference on Learning Theory. V. Feldman, A. Rakhlin, and O. Shamir, eds. 2016, 1517--1539.Google Scholar
Toneva, M., Sordoni, A., des Combes, R.T., Trischler, A., Bengio, Y., Gordon, G.J. An empirical study of example forgetting during deep neural network learning. In ICLR, 2019.Google Scholar
Vapnik, V.N. Statistical Learning Theory. Adaptive and Learning Systems for Signal Processing, Communications, and Control. Wiley, 1998.Google Scholar
Yao, Y., Rosasco, L., Caponnetto, A. On early stopping in gradient descent learning. Const. Approx. 2, 26 (2007), 289--315.Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations. 2017.Google Scholar
Zhao, S., Ren, H., Yuan, A., Song, J., Goodman, N., Ermon, S. Bias and generalization in deep generative models: An empirical study. In Advances in Neural Information Processing Systems 31. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds. Curran Associates, Inc., 2018, 10792--10801.Google Scholar
Zhou, W., Veitch, V., Austern, M., Adams, R.P., Orbanz, P. Non-vacuous generalization bounds at the ImageNet scale: A PAC-Bayesian compression approach. In International Conference on Learning Representations, 2019.Google Scholar

Index Terms

Understanding deep learning (still) requires rethinking generalization

Recommendations

Understanding Generalization in Deep Learning: An Empirical Approach
Read More
Deep learning: an overview and main paradigms

In the present paper, we examine and analyze main paradigms of learning of multilayer neural networks starting with a single layer perceptron and ending with deep neural networks, which are considered regarded as a breakthrough in the field of the ...
Read More
Unsupervised Learning of Human Action Categories in Still Images with Deep Representations

In this article, we propose a novel method for unsupervised learning of human action categories in still images. In contrast to previous methods, the proposed method explores distinctive information of actions directly from unlabeled image databases, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 64, Issue 3
March 2021
115 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3452024
Editor:
Andrew A. Chien
Association for Computing Machinery, New York, NY
Issue’s Table of Contents
Copyright © 2021 Owner/Author
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 February 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 962
  Total Citations
  View Citations
- 84,787
  Total Downloads
- Downloads (Last 12 months)17,847
- Downloads (Last 6 weeks)1,577
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Understanding deep learning (still) requires rethinking generalization

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Understanding Generalization in Deep Learning: An Empirical Approach

Deep learning: an overview and main paradigms

Unsupervised Learning of Human Action Categories in Still Images with Deep Representations