Skip to main content
Log in

Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning \((L/\mu )^2\) (where \(L\) is a bound on the smoothness and \(\mu \) on the strong convexity) to a linear dependence on \(L/\mu \). Furthermore, we show how reweighting the sampling distribution (i.e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results. We also discuss importance sampling for SGD more broadly and show how it can improve convergence also in other scenarios. Our results are based on a connection we make between SGD and the randomized Kaczmarz algorithm, which allows us to transfer ideas between the separate bodies of literature studying each of the two methods. In particular, we recast the randomized Kaczmarz algorithm as an instance of SGD, and apply our results to prove its exponential convergence, but to the solution of a weighted least squares problem rather than the original least squares problem. We then present a modified Kaczmarz algorithm with partially biased sampling which does converge to the original least squares solution with the same exponential convergence rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Bach and Moulines’s results are somewhat more general. Their Lipschitz requirement is a bit weaker and more complicated, but in terms of \(L_i\) yields (2.3). They also study the use of polynomial decaying step-sizes, but these do not lead to improved runtime if the target accuracy is known ahead of time.

References

  1. Bach, F., Moulines, E.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing Systems (NIPS) (2011)

  2. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer (2010)

  3. Bottou, L., Bousquet, O.: The tradeoffs of large-scale learning. In: Optimization for Machine Learning, p. 351 (2011)

  4. Bousquet, O., Bottou, L.: The tradeoffs of large scale learning. In: Advances in Neural Information Processing Systems, pp. 161–168 (2007)

  5. Boutsidis, C., Mahoney, M., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Symposium on Discrete Algorithms, pp. 968–977 (2009)

  6. Byrne, C.L.: Applied Iterative Methods. A K Peters Ltd., Wellesley, MA. ISBN:978-1-56881-342-4; 1-56881-271-X (2008)

  7. Cenker, C., Feichtinger, H.G., Mayer, M., Steier, H., Strohmer, T.: New variants of the POCS method using affine subspaces of finite codimension, with applications to irregular sampling. In: Proceedings of SPIE: Visual Communications and Image Processing, pp. 299–310 (1992)

  8. Censor, Y., Eggermont, P.P.B., Gordon, D.: Strong underrelaxation in Kaczmarz’s method for inconsistent systems. Numer. Math. 41(1), 83–92 (1983)

    Article  MATH  MathSciNet  Google Scholar 

  9. Chen, Y., Bhojanapalli, S., Sanghavi, S., Ward, R.: Coherent matrix completion. In: Proceedings of the 31st International Conference on Machine Learning, pp. 674–682 (2014)

  10. Eggermont, P.P.B., Herman, G.T., Lent, A.: Iterative algorithms for large partitioned linear systems, with applications to image reconstruction. Linear Algebra Appl. 40, 37–67 (1981). doi:10.1016/0024-3795(81)90139-7. ISSN:0024–3795

    Article  MATH  MathSciNet  Google Scholar 

  11. Eldar, Y.C., Needell, D.: Acceleration of randomized Kaczmarz method via the Johnson–Lindenstrauss lemma. Numer. Algorithms 58(2), 163–177 (2011). doi:10.1007/s11075-011-9451-z. ISSN:1017–1398

  12. Elfving, T.: Block-iterative methods for consistent and inconsistent linear equations. Numer. Math. 35(1), 1–12 (1980). doi:10.1007/BF01396365. ISSN:0029–599X

    Article  MATH  MathSciNet  Google Scholar 

  13. Foygel, R., Srebro, N.: Concentration-based guarantees for low-rank matrix reconstruction. In: 24th Annual Conference on Learning Theory (COLT) (2011)

  14. Gittens, A., Mahoney, M.: Revisiting the nyström method for improved large-scale machine learning. In: International Conference on Machine learning (ICML) (2013)

  15. Hanke, M., Niethammer, W.: On the acceleration of Kaczmarz’s method for inconsistent linear systems. Linear Algebra Appl. 130, 83–98 (1990)

    Article  MATH  MathSciNet  Google Scholar 

  16. Herman, G.T.: Fundamentals of computerized tomography: image reconstruction from projections. Springer, Berlin (2009)

    Book  Google Scholar 

  17. Herman, G.T., Meyer, L.B.: Algebraic reconstruction techniques can be made computationally efficient. IEEE Trans. Med. Imaging 12(3), 600–609 (1993)

    Article  Google Scholar 

  18. Hounsfield, G.N.: Computerized transverse axial scanning (tomography): Part 1. Description of system. Br. J. Radiol. 46(552), 1016–1022 (1973)

    Article  Google Scholar 

  19. Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bull. Int. Acad. Polon. Sci. Lett. Ser. A. 35, 335–357 (1937)

  20. Krahmer, F., Ward, R.: Stable and robust sampling strategies for compressive imaging. IEEE Trans. Image Proc. 23(2), 612–622 (2014)

    Article  MathSciNet  Google Scholar 

  21. Lai, M., Yin, W.: Augmented \(\ell _1\) and nuclear-norm models with a globally linearly convergent algorithm. SIAM J. Imaging Sci. 6(2), 1059–1091 (2013)

  22. Lee, Y.T., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pp. 147–156 (2013)

  23. Liu, J., Wright, S.J., Sridhar, S.: An asynchronous parallel randomized kaczmarz algorithm. arXiv:1401.4780 (2014)

  24. Ma, P., Yu, B., Mahoney, M.: A statistical perspective on algorithmic leveraging. In: International Conference on Machine Learning (ICML). arXiv:1306.5362 (2014)

  25. Mahoney, M.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)

  26. Murata, N.: A Statistical Study of On-Line Learning. Cambridge University Press, Cambridge (1998)

    Google Scholar 

  27. Natterer, F.: The Mathematics of Computerized Tomography, volume 32 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA (2001). ISBN:0-89871-493-1. doi:10.1137/1.9780898719284. Reprint of the 1986 original

  28. Needell, D.: Randomized Kaczmarz solver for noisy linear systems. BIT 50(2), 395–403 (2010). doi:10.1007/s10543-010-0265-5. ISSN:0006–3835

    Article  MATH  MathSciNet  Google Scholar 

  29. Needell, D., Tropp, J.A.: Paved with good intentions: analysis of a randomized block kaczmarz method. Linear Algebra Appl. 441, 199–221 (2014)

  30. Needell, D., Ward, R.: Two-subspace projection method for coherent overdetermined linear systems. J Fourier Anal. Appl. 19(2), 256–269 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  31. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  32. Nesterov, Y.: Introductory Lectures on Convex Optimization. Kluwer, Boston (2004)

    Book  MATH  Google Scholar 

  33. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  34. Popa, C.: Extensions of block-projections methods with relaxation parameters to inconsistent and rank-deficient least-squares problems. BIT 38(1), 151–176 (1998). doi:10.1007/BF02510922. ISSN:0006–3835

    Article  MATH  MathSciNet  Google Scholar 

  35. Popa, C.: Block-projections algorithms with blocks containing mutually orthogonal rows and columns. BIT 39(2), 323–338 (1999). doi:10.1023/A:1022398014630. ISSN:0006–3835

    Article  MATH  MathSciNet  Google Scholar 

  36. Popa, C.: A fast Kaczmarz–Kovarik algorithm for consistent least-squares problems. Korean J. Comput. Appl. Math. 8(1), 9–26 (2001). ISSN:1229–9502

    MATH  MathSciNet  Google Scholar 

  37. Popa, C.: A Kaczmarz–Kovarik algorithm for symmetric ill-conditioned matrices. An. Ştiinţ. Univ. Ovidius Constanţa Ser. Mat. 12(2), 135–146 (2004). ISSN:1223–723X

    MATH  Google Scholar 

  38. Popa, C., Preclik, T., Köstler, H., Rüde, U.: On Kaczmarz’s projection iteration as a direct solver for linear least squares problems. Linear Algebra Appl. 436(2), 389–404 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  39. Rauhut, H., Ward, R.: Sparse Legendre expansions via \(\ell _1\)-minimization. J. Approx. Theory 164(5), 517–533 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  40. Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2012)

  41. Robbins, H., Monroe, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  MATH  Google Scholar 

  42. Shalev-Shwartz, S., Srebro, N.: Svm optimization: inverse dependence on training set size. In: Proceedings of the 25th International Conference on Machine Learning, pp. 928–935 (2008)

  43. Shamir, O., Zhang, T.: Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes. arXiv:1212.1824 (2012)

  44. Spielman, D., Srivastava, N.: Graph sparsification by effective resistances. SIAM J. Comput. 40(6), 1913–1926 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  45. Srebro, N., Sridharan, K., Tewari, A.: Smoothness, low noise and fast rates. In: Advances in Neural Information Processing Systems, pp. 2199–2207 (2010)

  46. Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009). doi:10.1007/s00041-008-9030-4. ISSN:1069–5869

  47. Tanabe, K.: Projection method for solving a singular system of linear equations and its applications. Numer. Math. 17(3), 203–214 (1971)

    Article  MATH  MathSciNet  Google Scholar 

  48. Wang, S., Zhang, Z.: Improving cur matrix decomposition and the nystrom approximation via adaptive sampling. J. Mach. Learn. Res. 14, 2729–2769 (2013)

    MATH  MathSciNet  Google Scholar 

  49. Whitney, T.M., Meany, R.K.: Two algorithms related to the method of steepest descent. SIAM J. Numer. Anal. 4(1), 109–118 (1967)

    Article  MATH  MathSciNet  Google Scholar 

  50. Zhang, H., Yin, W.: Gradient methods for convex minimization: better rates under weaker conditions. CAM Report 13–17, UCLA (2013)

  51. Zhao, P., Zhang, T.: Stochastic optimization with importance sampling. arXiv:1401.2753 (2014)

  52. Zouzias, A., Freris, N.M.: Randomized extended Kaczmarz for solving least-squares. SIAM J. Matrix Anal. Appl. 34(2), 773–793 (2013)

Download references

Acknowledgments

FundingWe would like to thank the anonymous reviewers for their useful feedback which significantly improved the manuscript. We would like to thank Chris White for pointing out a simplified proof of Corollary 2.2. DN was partially supported by a Simons Foundation Collaboration grant, NSF CAREER #1348721 and an Alfred P. Sloan Fellowship. NS was partially supported by a Google Research Award. RW was supported in part by ONR Grant N00014-12-1-0743, an AFOSR Young Investigator Program Award, and an NSF CAREER award.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deanna Needell.

Appendix: Proofs

Appendix: Proofs

Our main results utilize an elementary fact about smooth functions with Lipschitz continuous gradient, called the co-coercivity of the gradient. We state the lemma and recall its proof for completeness.

1.1 The co-coercivity Lemma

Lemma 8.1

(Co-coercivity) For a smooth function \(f\) whose gradient has Lipschitz constant \(L\),

$$\begin{aligned} ||{\nabla f(\varvec{x}) - \nabla f(\varvec{y})} ||_2^2 \le L\left\langle \varvec{x}-\varvec{y}, \nabla f(\varvec{x}) - \nabla f(\varvec{y})\right\rangle . \end{aligned}$$

Proof

Since \(\nabla f\) has Lipschitz constant \(L\), if \({\varvec{x}}_{\star }\) is the minimizer of \(f\), then (see e.g. [32], page 26)

(8.1)

Now define the convex functions

$$\begin{aligned} G(\varvec{z}) = f(\varvec{z}) - \left\langle \nabla f(\varvec{x}), \varvec{z}\right\rangle , \quad \text {and}\quad H(\varvec{z}) = f(\varvec{z}) - \left\langle \nabla f(\varvec{y}), \varvec{z}\right\rangle , \end{aligned}$$

and observe that both have Lipschitz constants \(L\) and minimizers \(\varvec{x}\) and \(\varvec{y}\), respectively. Applying (8.1) to these functions therefore gives that

$$\begin{aligned} G(\varvec{x}) \le G(\varvec{y}) - \frac{1}{2L}||{\nabla G(\varvec{y})} ||_2^2, \quad \text {and} \quad H(\varvec{y}) \le H(\varvec{x}) - \frac{1}{2L}||{\nabla H(\varvec{y})} ||_2^2. \end{aligned}$$

By their definitions, this implies that

Adding these two inequalities and canceling terms yields the desired result. \(\square \)

1.2 Proof of Theorem 2.1

With the notation of Theorem 2.1, and where \(i\) is the random index chosen at iteration \(k\), and \(w=w_{\lambda }\), we have

where we have employed Jensen’s inequality in the first inequality and the co-coercivity Lemma 8.1 in the final line. We next take an expectation with respect to the choice of \(i\). By assumption, \(i\sim {\mathcal {D}}\) such that \(F(\varvec{x}) = \mathbb {E} f_i(\varvec{x})\) and \(\sigma ^2 = {\mathbb {E}}\Vert \nabla f_i({\varvec{x}}_{\star })\Vert ^2\). Then \({\mathbb {E}}\nabla f_i(\varvec{x})=\nabla {\mathbf {F}}(\varvec{x})\), and we obtain:

$$\begin{aligned} {\mathbb {E}}||{{\varvec{x}}_{k+1}-{\varvec{x}}_{\star }} ||_2^2&\le ||{{\varvec{x}}_k- {\varvec{x}}_{\star }} ||_2^2 - 2\gamma \left\langle {\varvec{x}}_k-{\varvec{x}}_{\star }, \nabla {\mathbf {F}}({\varvec{x}}_k)\right\rangle \\&+ 2\gamma ^2 {\mathbb {E}}\left[ L_i \left\langle {\varvec{x}}_k-{\varvec{x}}_{\star }, \nabla f_i({\varvec{x}}_k) - \nabla f_i({\varvec{x}}_{\star })\right\rangle \right] + 2\gamma ^2 {\mathbb {E}}||{\nabla f_i({\varvec{x}}_{\star })} ||_2^2 \\&\le ||{{\varvec{x}}_k- {\varvec{x}}_{\star }} ||_2^2 - 2\gamma \left\langle {\varvec{x}}_k-{\varvec{x}}_{\star }, \nabla {\mathbf {F}}({\varvec{x}}_k)\right\rangle \\&\quad +\, 2\gamma ^2 \sup _i L_i {\mathbb {E}}\left\langle {\varvec{x}}_k-{\varvec{x}}_{\star }, \nabla f_i({\varvec{x}}_k) - \nabla f_i({\varvec{x}}_{\star })\right\rangle + 2\gamma ^2 {\mathbb {E}}||{\nabla f_i({\varvec{x}}_{\star })} ||_2^2 \\&= ||{{\varvec{x}}_k- {\varvec{x}}_{\star }} ||_2^2 - 2\gamma \left\langle {\varvec{x}}_k-{\varvec{x}}_{\star }, \nabla {\mathbf {F}}({\varvec{x}}_k)\right\rangle \\&\quad +\, 2\gamma ^2 \sup L \left\langle {\varvec{x}}_k-{\varvec{x}}_{\star }, \nabla {\mathbf {F}}({\varvec{x}}_k) - \nabla {\mathbf {F}}({\varvec{x}}_{\star })\right\rangle + 2\gamma ^2 \sigma ^2 \end{aligned}$$

We now utilize the strong convexity of \(F(\varvec{x})\) and obtain that

$$\begin{aligned}&\le ||{{\varvec{x}}_k- {\varvec{x}}_{\star }} ||_2^2 - 2\gamma \mu (1 - \gamma \sup L) ||{{\varvec{x}}_k-{\varvec{x}}_{\star }} ||_2^2 + 2\gamma ^2 \sigma ^2 \\&= (1 - 2\gamma \mu (1 - \gamma \sup L)) ||{{\varvec{x}}_k-{\varvec{x}}_{\star }} ||_2^2 + 2\gamma ^2 \sigma ^2 \end{aligned}$$

when \(\gamma \le \frac{1}{ \sup L}\). Recursively applying this bound over the first \(k\) iterations yields the desired result,

$$\begin{aligned} {\mathbb {E}}||{{\varvec{x}}_k-{\varvec{x}}_{\star }} ||_2^2&\le \Big (1 - 2\gamma \mu (1 - \gamma \sup L)\Big )\Big )^k||{{\varvec{x}}_0- {\varvec{x}}_{\star }} ||_2^2\\&\quad +\, 2\sum _{j=0}^{k-1} \Big (1 - 2\gamma \mu (1 - \gamma \sup L)\Big )\Big )^j \gamma ^2\sigma ^2 \\&\le \Big (1 - 2\gamma \mu (1 - \gamma \sup L)\Big )\Big )^k ||{{\varvec{x}}_0- {\varvec{x}}_{\star }} ||_2^2 + \frac{\gamma \sigma ^2}{\mu \big ( 1 - \gamma \sup L \big )}. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Needell, D., Srebro, N. & Ward, R. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Math. Program. 155, 549–573 (2016). https://doi.org/10.1007/s10107-015-0864-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-015-0864-7

Keywords

Mathematics Subject Classification

Navigation