skip to main content
article
Open Access

Understanding deep learning (still) requires rethinking generalization

Published:22 February 2021Publication History
Skip Abstract Section

Abstract

Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small gap between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training.

Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice.

We interpret our experimental findings by comparison with traditional models.

We supplement this republication with a new section at the end summarizing recent progresses in the field since the original version of this paper.

References

  1. Arora, S., Cohen, N., Hu, W., Luo, Y. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché Buc, E. Fox, and R. Garnett, eds. Vol. 32. Curran Associates, Inc., 2019, 7411--7422.Google ScholarGoogle Scholar
  2. Arora, S., Ge, R., Neyshabur, B., Zhang, Y. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning. J. Dy and A. Krause, eds. 2018, 254--263.Google ScholarGoogle Scholar
  3. Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M.S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y., Lacoste-Julien, S. A closer look at memorization in deep networks. In International Conference on Machine Learning. D. Precup and Y.W. Teh, eds. 2017, 233--242.Google ScholarGoogle Scholar
  4. Bartlett, P.L. The sample complexity of pattern classification with neural networks---The size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory, 44 (1998), 525--536.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bartlett, P.L., Foster, D.J., Telgarsky, M.J. Spectrally-normalized margin bounds for neural networks. Adv. Neural Inform. Process. Syst. 2017, 6240--6249.Google ScholarGoogle Scholar
  6. Belkin, M., Hsu, D., Ma, S., Mandal, S. Reconciling modern machine-learning practice and the classical bias--variance trade-off. Proc. Natl. Acad. Sci. 32, 116 (2019), 15849--15854.Google ScholarGoogle Scholar
  7. Belkin, M., Hsu, D.J., Mitra, P. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. Adv. Neural Inform. Process. Syst., 2018, 2300--2311.Google ScholarGoogle Scholar
  8. Bousquet, O., Elisseeff, A. Stability and generalization. J. Mach. Learn. Res. 2 (2002), 499--526.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B., LeCun, Y. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics. G. Lebanon and S.V.N. Vishwanathan, eds. 2015, 192--204.Google ScholarGoogle Scholar
  10. Coates, A., Ng, A.Y. Learning feature representations with k-means. In Neural Networks: Tricks of the Trade, Reloaded. Springer, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  11. Cohen, N., Shashua, A. Convolutional rectifier networks as generalized tensor decompositions. In International Conference on Machine Learning. M.F. Balcan and K.Q. Weinberger, eds. 2016, 955--963.Google ScholarGoogle Scholar
  12. Cybenko, G. Approximation by superposition of sigmoidal functions. Math. Control Signal. Syst. 4, 2 (1989), 303--314.Google ScholarGoogle Scholar
  13. Delalleau, O., Bengio, Y. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems 24. J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, K. Weinberger, eds. Curran Associates, Inc., 2011, 666--674Google ScholarGoogle Scholar
  14. Dziugaite, G.K., Roy, D.M. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence. 2017.Google ScholarGoogle Scholar
  15. Eldan, R., Shamir, O. The power of depth for feedforward neural networks. In Conference on Learning Theory. V. Feldman, A. Rakhlin, and O. Shamir, eds. 2016, 907--940.Google ScholarGoogle Scholar
  16. Feldman V. Does learning require memorization? a short tale about a long tail. arXiv preprint arXiv:1906.05271 (2019).Google ScholarGoogle Scholar
  17. Golowich, N., Rakhlin, A., Shamir, O. Size-independent sample complexity of neural networks. In Conference On Learning Theory. P.R. Sébastien Bubeck, V. Perchet, eds., 2018, 297--299.Google ScholarGoogle Scholar
  18. Hardt, M., Recht, B., Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning. M.F. Balcan and K.Q. Weinberger, eds. 2016, 1225--1234.Google ScholarGoogle Scholar
  19. Kawaguchi, K., Kaelbling, L.P., Bengio, Y. Generalization in deep learning. CoRR, arXiv:1710.05468 (2017).Google ScholarGoogle Scholar
  20. Liang, T., Poggio, T., Rakhlin, A., Stokes, J. Fisher-rao metric, geometry, and complexity of neural networks. In The 22nd International Conference on Artificial Intelligence and Statistics. K. Chaudhuri and M. Sugiyama, eds. arXiv:1711.01530 (2017), 888--896.Google ScholarGoogle Scholar
  21. Lin, J., Camoriano, R., Rosasco, L. Generalization properties and implicit regularization for multiple passes SGM. In International Conference on Machine Learning. M.F. Balcan and K.Q. Weinberger, eds. 2016, 2340--2348.Google ScholarGoogle Scholar
  22. Livni, R., Shalev-Shwartz, S., Shamir, O. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems 27. Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, eds. Curran Associates, Inc., 2014, 855--863.Google ScholarGoogle Scholar
  23. Ma, S., Bassily, R., Belkin, M. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning. J. Dy and A. Krause, eds. 2018, 3325--3334.Google ScholarGoogle Scholar
  24. Mhaskar, H. Poggio, T.A. Deep vs. shallow networks: An approximation theory perspective. Anal. Appl. 6, 14 (2016).Google ScholarGoogle Scholar
  25. Mhaskar, H.N. Approximation properties of a multilayered feedforward artificial neural network. Adv. Comput. Math. 1, 1 (1993), 61--80.Google ScholarGoogle ScholarCross RefCross Ref
  26. Morcos, A., Raghu, M., Bengio, S. Insights on representational similarity in neural networks with canonical correlation. Adv. Neural Inform. Process. Syst.. 2018, 5727--5736.Google ScholarGoogle Scholar
  27. Mukherjee, S., Niyogi, P., Poggio, T., Rifkin R. Statistical learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Technical Report AI Memo 2002-024. Massachusetts Institute of Technology, 2002.Google ScholarGoogle Scholar
  28. Neyshabur, B., Bhojanapalli, S., McAllester, D., Srebro, N. Exploring generalization in deep learning. Adv. Neural Inform. Process. Syst., 2017, 5947--5956.Google ScholarGoogle Scholar
  29. Neyshabur, B., Bhojanapalli, S., Srebro, N. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.Google ScholarGoogle Scholar
  30. Neyshabur, B., Tomioka, R., Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. CoRR, abs/1412.6614, 2014.Google ScholarGoogle Scholar
  31. Neyshabur, B., Tomioka, R., Srebro, N. Norm-based capacity control in neural networks. In Conference on Learning Theory. S.K. Peter Grünwald and E. Hazan, eds. 2015, 1376--1401.Google ScholarGoogle Scholar
  32. Poggio, T., Rifkin, R., Mukherjee, S., Niyogi, P. General conditions for predictivity in learning theory. Nature 6981, 428 (2004), 419--422.Google ScholarGoogle Scholar
  33. Recht, B., Roelofs, R., Schmidt, L., Shankar, V. Do imagenet classifiers generalize to imagenet? arXiv preprint arXiv:1902.10811 (2019).Google ScholarGoogle Scholar
  34. Schölkopf, B., Herbrich, R., Smola, A.J. A generalized representer theorem. In Conference on Learning Theory. 2001, 416--426.Google ScholarGoogle ScholarCross RefCross Ref
  35. Shah, V., Kyrillidis, A., Sanghavi, S. Minimum norm solutions do not always generalize well for over-parameterized problems. CoRR. arXiv:1811.07055 (2018).Google ScholarGoogle Scholar
  36. Shalev-Shwartz, S., Shamir, O., Srebro, N., Sridharan, K. Learnability, stability and uniform convergence. J. Mach. Learn. Res., 11 (2010), 2635--2670.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Smith, S.L., Le, Q.V. A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, 2018.Google ScholarGoogle Scholar
  38. Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S., Srebro, N. The implicit bias of gradient descent on separable data. J. Mach. Learn. Res. 70, 19 (2018), 1--57.Google ScholarGoogle Scholar
  39. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 1, 15 (2014), 1929--1958.Google ScholarGoogle Scholar
  40. Telgarsky, M. Benefits of depth in neural networks. In Conference on Learning Theory. V. Feldman, A. Rakhlin, and O. Shamir, eds. 2016, 1517--1539.Google ScholarGoogle Scholar
  41. Toneva, M., Sordoni, A., des Combes, R.T., Trischler, A., Bengio, Y., Gordon, G.J. An empirical study of example forgetting during deep neural network learning. In ICLR, 2019.Google ScholarGoogle Scholar
  42. Vapnik, V.N. Statistical Learning Theory. Adaptive and Learning Systems for Signal Processing, Communications, and Control. Wiley, 1998.Google ScholarGoogle Scholar
  43. Yao, Y., Rosasco, L., Caponnetto, A. On early stopping in gradient descent learning. Const. Approx. 2, 26 (2007), 289--315.Google ScholarGoogle Scholar
  44. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations. 2017.Google ScholarGoogle Scholar
  45. Zhao, S., Ren, H., Yuan, A., Song, J., Goodman, N., Ermon, S. Bias and generalization in deep generative models: An empirical study. In Advances in Neural Information Processing Systems 31. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds. Curran Associates, Inc., 2018, 10792--10801.Google ScholarGoogle Scholar
  46. Zhou, W., Veitch, V., Austern, M., Adams, R.P., Orbanz, P. Non-vacuous generalization bounds at the ImageNet scale: A PAC-Bayesian compression approach. In International Conference on Learning Representations, 2019.Google ScholarGoogle Scholar

Index Terms

  1. Understanding deep learning (still) requires rethinking generalization

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Communications of the ACM
          Communications of the ACM  Volume 64, Issue 3
          March 2021
          115 pages
          ISSN:0001-0782
          EISSN:1557-7317
          DOI:10.1145/3452024
          Issue’s Table of Contents

          Copyright © 2021 Owner/Author

          This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 February 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format