Skip to main content
Log in

Sampling Kaczmarz-Motzkin method for linear feasibility problems: generalization and acceleration

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

Randomized Kaczmarz, Motzkin Method and Sampling Kaczmarz Motzkin (SKM) algorithms are commonly used iterative techniques for solving a system of linear inequalities (i.e., \(Ax \le b\)). As linear systems of equations represent a modeling paradigm for solving many optimization problems, these randomized and iterative techniques are gaining popularity among researchers in different domains. In this work, we propose a Generalized Sampling Kaczmarz Motzkin (GSKM) method that unifies the iterative methods into a single framework. In addition to the general framework, we propose a Nesterov-type acceleration scheme in the SKM method called Probably Accelerated Sampling Kaczmarz Motzkin (PASKM). We prove the convergence theorems for both GSKM and PASKM algorithms in the \(L_2\) norm perspective with respect to the proposed sampling distribution. Furthermore, we prove sub-linear convergence for the Cesaro average of iterates for the proposed GSKM and PASKM algorithms. From the convergence theorem of the GSKM algorithm, we find the convergence results of several well-known algorithms like the Kaczmarz method, Motzkin method and SKM algorithm. We perform thorough numerical experiments using both randomly generated and real-world (classification with support vector machine and Netlib LP) test instances to demonstrate the efficiency of the proposed methods. We compare the proposed algorithms with SKM, Interior Point Method and Active Set Method in terms of computation time and solution quality. In the majority of the problem instances, the proposed generalized and accelerated algorithms significantly outperform the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. We use the notation \((Ax-b)^+_{\underline{\mathbf {i_j}}}\) throughout the paper to express the underlying expectation, where the indices \(\underline{\mathbf {i_j}}\) represent the sampling process of 5.

  2. For ease of notation, throughout the paper, we will use \({\mathbb {S}}_k\) to denote the sampling distribution corresponding to any random iterate \(x_k \in \mathbb {R}^n\).

  3. Similarly, we will use \(\tau _k \sim {\mathbb {S}}_k\) to denote the sampled set and \(i^* = {{\,\mathrm{arg\,max}\,}}_{i \in \tau _k \sim {\mathbb {S}}_k} \{a_i^Tx_k-b_i, 0\} \ = \ {{\,\mathrm{arg\,max}\,}}_{i \in \tau _k \sim {\mathbb {S}}_k} (A_{\tau _k}x_k-b_{\tau _k})^+_i\) for any iterate \(x_k \in \mathbb {R}^n\).

  4. Similar type of functions with uniform sampling have been studied in [10, 12] and [39, 40, 58] in the context of stochastic gradient descent, alternating projection algorithms and GR sketching methods respectively.

  5. Similar type of approach can be found in the literature for solving linear system [39, 62].

  6. Recently, heavy ball momentum method has been proposed in the context of SKM method [63] and Sketching methods [64] for solving linear system [65] and LF problems.

  7. Several works exit for the Kaczmarz-type methods for solving linear systems [39, 58].

  8. When, \(\delta = 2\), we have \(h(\delta ) = 1-\eta \mu _1 = 1\). In that case, we can simplify the condition of (29) as \(\omega < \frac{\mu _1(2-\gamma )}{3\mu _1+1}\). In other words, for \(\delta =2\) the PASKM algorithm will converge if we select the parameters as \(0 \le \gamma < 2\), \(0 \le \alpha \le 1\), \(\omega < \frac{\mu _1(2-\gamma )}{3\mu _1+1}\) and \( \mu _1 = \lambda _{\min }^+(A^TA)/m\).

  9. Note that for the choice \(p > \frac{1}{\mu _1}\) the condition (30) trivially holds as the right hand side of (30) is always greater than 1.

  10. Note that there are sudden jumps in the FSC graphs whenever FSC crosses 0.6 approximately. This phenomena is caused by the stopping criteria of our experiments. We are halting the algorithm whenever the residual error satisfy the threshold \(10^{-05}\), that’s why there are sudden jumps. One can run the algorithm with smaller threshold to mitigate this.

  11. The credit card data matrix has 30, 000 rows. From our earlier experiments, we observe that the choice of \(1 < \beta \le 100\), the proposed algorithms produce the best performance. For that reason, we plot the credit card graph up-to \(\beta = 2000\). The irregularity of the credit card graph occurs when \(\beta > 2000\).

  12. We note the best possible time from our previous experiments.

  13. Note that, since \(\varGamma _1 \varGamma _2 \le 0\) and \(\varGamma _1 \varGamma _3 \le 1 \le \sqrt{(1+\phi )}\), from Theorem 4 we have \({{\,\mathrm{\mathbb {E}}\,}}[d(x_{k+1},P)] \le \sqrt{(1+\phi )} \ \rho _2^k \ d(x_0,P)\) for any \(\xi \in Q_2\).

References

  1. Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  2. Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35(3), 641–654 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  3. Needell, D.: Randomized Kaczmarz solver for noisy linear systems. BIT Numer. Math. 50(2), 395–403 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  4. Drineas, P., Mahoney, M.W., Muthukrishnan, S., Sarlós, T.: Faster least squares approximation. Numer. Math. 117(2), 219–249 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  5. Zouzias, A., Freris, N.M.: Randomized extended Kaczmarz for solving least squares. SIAM J. Matrix Anal. Appl. 34(2), 773–793 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  6. Lee, Y.T., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In: Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, FOCS ’13, pp. 147–156, Washington, DC, USA, 2013. IEEE Computer Society

  7. Ma, A., Needell, D., Ramdas, A.: Convergence properties of the randomized extended Gauss Seidel and Kaczmarz methods. SIAM J. Matrix Anal. Appl. 36(4), 1590–1604 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  8. Gower, R.M., Richtárik, P.: Randomized iterative methods for linear systems. SIAM J. Matrix Anal. Appl. 36(4), 1660–1690 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  9. Qu, Z., Richtarik, P., Takac, M., Fercoq, O.: SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization. In: Proceedings of The 33rd International Conference on Machine Learning, vol. 48, pp. 1823–1832, New York, USA, 20–22 Jun 2016. PMLR

  10. Needell, D., Srebro, N., Ward, R.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Math. Program. 155(1), 549–573 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  11. De Loera, J., Haddock, J., Needell, D.: A Sampling Kaczmarz-Motzkin algorithm for linear feasibility. SIAM J. Sci. Comput. 39(5), S66–S87 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  12. Razaviyayn, M., Hong, M., Reyhanian, N., Luo, Z.-Q.: A linearly convergent doubly stochastic Gauss-Seidel algorithm for solving linear equations and a certain class of over-parameterized optimization problems. Math. Program. 176(1), 465–496 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  13. Haddock, J., Ma, A.: Greed works: an improved analysis of Sampling Kaczmarz-Motkzin (2019)

  14. Kaczmarz, S.: Angenaherte auflsung von systemen linearer gleichungen. Bulletin International de l’Acadmie Polonaise des Sciences et des Letters 35, 355–357 (1937)

    MATH  Google Scholar 

  15. Gordon, R., Bender, R., Herman, G.T.: Algebraic reconstruction techniques (art) for three-dimensional electron microscopy and x-ray photography. J. Theor. Biol. 29(3), 471–481 (1970)

    Article  Google Scholar 

  16. Censor, Y.: Parallel application of block-iterative methods in medical imaging and radiation therapy. Math. Program. 42(1), 307–325 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  17. Herman, G.T.: Fundamentals of Computerized Tomography: Image Reconstruction from Projections, 2nd edn. Springer, New York (2009)

    Book  MATH  Google Scholar 

  18. Lorenz, D.A., Wenger, S., Schöpfer, F., Magnor, M.: A sparse Kaczmarz solver and a linearized bregman method for online compressed sensing. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 1347–1351 (2014)

  19. Elble, J.M., Sahinidis, N.V., Vouzis, P.: Gpu computing with Kaczmarz’s and other iterative algorithms for linear systems. Parallel Comput. 36(5), 215–231 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  20. Pasqualetti, F., Carli, R., Bullo, F.: Distributed estimation via iterative projections with application to power network monitoring. Automatica 48(5), 747–758 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  21. Censor, Y.: Row-action methods for huge and sparse systems and their applications. SIAM Rev. 23(4), 444–466 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  22. Agamon, S.: The relaxation method for linear inequalities. Can. J. Math. 382–392 (1954)

  23. Motzkin, T.S., Schoenberg, I.J.: The relaxation method for linear inequalities. Can. J. Math. 393–404 (1954)

  24. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65–386 (1958)

  25. Ramdas, A., Peña, J.: Margins, kernels and non-linear smoothed perceptrons. In: Proceedings of the 31st International Conference on Machine Learning, vol. 32, pp. 244–252, Bejing, China, 22–24 Jun 2014. PMLR

  26. Ramdas, A., Peña, J.: Towards a deeper geometric, analytic and algorithmic understanding of margins. Optim. Methods Softw. 31(2), 377–391 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  27. Nutini, J., Sepehry, B., Laradji, I., Schmidt, M., Koepke, H., Virani, A.: Convergence rates for greedy Kaczmarz algorithms, and faster randomized Kaczmarz rules using the orthogonality graph. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI’16, pp. 547–556, Arlington, Virginia, United States. AUAI Press (2016)

  28. Petra, S., Popa, C.: Single projection Kaczmarz extended algorithms. Numer. Algorithms 73(3), 791–806 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  29. Telgen, J.: On relaxation methods for systems of linear inequalities. Eur. J. Oper. Res. 9(2), 184–189 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  30. Maurras, J.F., Truemper, K., Akgül, M.: Polynomial algorithms for a class of linear programs. Math. Program. 21(1), 121–136 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  31. Chubanov, S.: A strongly polynomial algorithm for linear systems having a binary solution. Math. Program. 134(2), 533–570 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  32. Chubanov, S.: A polynomial projection algorithm for linear feasibility problems. Math. Program. 153(2), 687–713 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  33. Liu, J., Wright, S.J.: An accelerated randomized Kaczmarz algorithm. Math. Comput. 85(297), 153–178 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  34. Needell, D., Zhao, R., Zouzias, A.: Randomized block Kaczmarz method with projection for solving least squares. Linear Algebra Appl. 484, 322–343 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  35. Ma, A., Needell, D., Ramdas, A.: Convergence properties of the randomized extended Gauss-Seidel and Kaczmarz methods. SIAM J. Matrix Anal. Appl. 36(4), 1590–1604 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  36. Hefny, A., Needell, D., Ramdas, A.: Rows versus columns: randomized Kaczmarz or Gauss-Seidel for ridge regression. SIAM J. Sci. Comput. 39(5), S528–S542 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  37. Gower, R.M., Richtárik, P.: Linearly convergent randomized iterative methods for computing the pseudoinverse (2016)

  38. Gower, R.M., Richtárik, P.: Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms. SIAM J. Matrix Anal. Appl. 38(4), 1380–1409 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  39. Richtárik, P., Takáč, M.: Stochastic reformulations of linear systems: algorithms and convergence theory (2017)

  40. Gower, R., Hanzely, F., Richtarik, P., Stich, S.U.: Accelerated stochastic matrix inversion: general theory and speeding up BFGS rules for faster second-order optimization. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.), Advances in Neural Information Processing Systems, vol. 31, pp. 1619–1629. Curran Associates, Inc. (2018)

  41. Needell, D., Tropp, J.A.: Paved with good intentions: analysis of a randomized block Kaczmarz method. Linear Algebra Appl. 441, 199–221 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  42. Briskman, J., Needell, D.: Block Kaczmarz method with inequalities. J. Math. Imaging Vis. 52(3), 385–396 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  43. Needell, D., Rebrova, E.: On block Gaussian sketching for the Kaczmarz method (2019)

  44. Basu, A., De Loera, J.A., Junod, M.: On Chubanov’s method for linear programming. INFORMS J. Comput. 26(2), 336–350 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  45. Végh, A.L., Zambelli, G.: A polynomial projection-type algorithm for linear programming. Oper. Res. Lett. 42(1), 91–96 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  46. Eldar, Y.C., Needell, D.: Acceleration of randomized Kaczmarz method via the Johnson-Lindenstrauss lemma. Numer. Algorithms 58(2), 163–177 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  47. Agaskar, A., Wang, C., Lu, Y.M.: Randomized Kaczmarz algorithms: exact MSE analysis and optimal sampling probabilities. In: 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 389–393 (2014)

  48. Bai, Z.-Z., Wen-Ting, W.: On relaxed greedy randomized Kaczmarz methods for solving large sparse linear systems. Appl. Math. Lett. 83, 21–26 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  49. Bai, Z.-Z., Wu, W.-T.: On greedy randomized Kaczmarz method for solving large sparse linear systems. SIAM J. Sci. Comput. 40(1), A592–A606 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  50. Kovachki, N.B., Stuart, A.M.: Analysis of momentum methods. arXiv preprint arXiv:1906.04285 (2019)

  51. Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)

  52. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)

    Article  Google Scholar 

  53. Nesterov, Y.: A method for solving the convex programming problem with convergence rate \(o(1/k^{2})\). Sov. Math. Dokl. 27, 372–376 (1983)

    MATH  Google Scholar 

  54. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  55. Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  56. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, 1st edn. Springer, New York (2014)

    MATH  Google Scholar 

  57. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  58. Loizou, N., Richtárik, P.: Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods (2017)

  59. Morshed, M.S., Noor-E-Alam, M.: Generalized affine scaling algorithms for linear programming problems. Comput. Oper. Res. 114, 104807 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  60. Loizou, M.R.N., Richtárik, P.: Provably accelerated randomized gossip algorithms (2018)

  61. Morshed, Md.S., Islam, Md.S., Noor-E-Alam, Md.: Accelerated sampling Kaczmarz motzkin algorithm for the linear feasibility problem. J. Glob. Optim. (2019)

  62. Censor, Y., Gordon, D., Gordon, R.: Component averaging: an efficient iterative parallel algorithm for large and sparse unstructured problems. Parallel Comput. 27(6), 777–808 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  63. Morshed, Md.S., Noor-E-Alam, Md.: Heavy Ball Momentum Induced Sampling Kaczmarz Motzkin Methods for Linear Feasibility Problems. arXiv preprint arXiv:200908251 (2020)

  64. Morshed, Md.S., Noor-E-Alam, Md.: Sketch & Project Methods for Linear Feasibility Problems: Greedy Sampling & Momentum. arXiv preprint arXiv:2012.02913 (2020)

  65. Morshed, Md.S., Ahmad, S., Noor-E-Alam, Md.: Stochastic Steepest Descent Methods for Linear Systems: Greedy Sampling & Momentum. arXiv preprint arXiv:2012.13087 (2020)

  66. Hoffman, A.J.: On approximate solutions of systems of linear inequalities. In: Selected Papers of Alan J Hoffman: With Commentary, pp. 174–176. World Scientific (2003)

  67. Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer, Cham (2016)

    Chapter  Google Scholar 

  68. Khachiyan, L.G.: Polynomial algorithms in linear programming. USSR Comput. Math. Math. Phys. 20(1), 53–72 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  69. Yeh, I.-C., Lien, C.: The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 36(2), 2473–2480 (2009)

    Article  Google Scholar 

  70. Lichman, M.: UCI machine learning repository (2013)

  71. Netlib. The netlib linear programming library

  72. Calafiore, G., Ghaoui, L.E.: Optimization Models. Control systems and optimization series. Cambridge University Press (2014)

Download references

Acknowledgements

The authors are truly grateful to the anonymous referees and the editors for their valuable comments and suggestions in the earlier version of the paper. The comments helped immensely in the revision process and greatly improved the quality of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md. Noor-E-Alam.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

Proof of Lemma 5

Using the expectation expression given in the first section we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ a_{i^*}a_{i^*}^T\right]&= \frac{1}{\left( {\begin{array}{c}m\\ \beta \end{array}}\right) } \sum \limits _{j = 0}^{m-\beta } \left( {\begin{array}{c}\beta -1+j\\ \beta -1\end{array}}\right) (A^TA)_{\underline{\mathbf {i_j}}} \\&\preceq \ \frac{\left( {\begin{array}{c}m-1\\ \beta -1\end{array}}\right) }{\left( {\begin{array}{c}m\\ \beta \end{array}}\right) } \sum \limits _{j = 0}^{m-\beta } (A^TA)_{\underline{\mathbf {i_j}}} \ \preceq \ \frac{\beta }{m} \sum \limits _{i = 1}^{m} a_{i} a_{i}^T \ = \ \frac{\beta }{m} A^TA. \end{aligned}$$

Here, the notation \((A^TA)_{\underline{\mathbf {i_j}}}\) denotes the matrix \(a_{i_j} a_{i_j}^T\) where the index \(i_j\) belongs to the list (5). Furthermore, the index l corresponds to the \((\beta + j)^{th}\) entry on the list (5). This proves the Lemma. \(\square \)

Proof of Lemma 6

Using the definition of the expectation from (7), we have,

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ \big | (a_{i^*}^Tx-b_{i^*})^+\big |^2\right]&\overset{(7)}{=} \frac{1}{\left( {\begin{array}{c}m\\ \beta \end{array}}\right) } \sum \limits _{j = 0}^{m-\beta } \left( {\begin{array}{c}\beta -1+j\\ \beta -1\end{array}}\right) | (Ax-b)^{+}_{\underline{\mathbf {i_j}}} |^2 \\&\overset{\text {Lemma} \ 2}{\ge } \frac{1}{\left( {\begin{array}{c}m\\ \beta \end{array}}\right) } \sum \limits _{j = 0}^{m-\beta } \frac{\sum \limits _{l = 0}^{m-\beta } \left( {\begin{array}{c}\beta -1+l\\ \beta -1\end{array}}\right) }{m-\beta +1} | (Ax-b)^{+}_{\underline{\mathbf {i_j}}} |^2 \\&\ge \frac{1}{m-\beta +1} \sum \limits _{j = 0}^{m-\beta } \big |(Ax - b)^+_{\underline{\mathbf {i_j}}} \big |^2 \\&\ge \frac{1}{m-\beta +1} \min \{\frac{m-\beta +1}{m-s}, 1\} \ \Vert (Ax - b)^+\Vert ^2 \\&\overset{\text {Lemma} \ 1}{\ge } \ \frac{1}{m L^2} \ d(x,P)^2. \end{aligned}$$

Here, s is the number of zero entries in the residual \((Ax - b)^+\), which also corresponds to the number of satisfied constraints for x. Since \(A {\mathcal {P}} (x) \le b\), we have the following:

$$ \begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ \big | (a_{i^*}^Tx-b_{i^*})^+\big |^2\right]&\overset{(7)}{=} \frac{1}{\left( {\begin{array}{c}m\\ \beta \end{array}}\right) } \sum \limits _{j = 0}^{m-\beta } \left( {\begin{array}{c}\beta -1+j\\ \beta -1\end{array}}\right) | (Ax-b)^{+}_{\underline{\mathbf {i_j}}}|^2 \\&\le \frac{1}{\left( {\begin{array}{c}m\\ \beta \end{array}}\right) } \sum \limits _{j = 0}^{m-\beta } \left( {\begin{array}{c}\beta -1+j\\ \beta -1\end{array}}\right) \ \big | (Ax-A{\mathcal {P}} (x))_{\underline{\mathbf {i_j}}} \big |^2 \\&= \frac{1}{\left( {\begin{array}{c}m\\ \beta \end{array}}\right) } (x-{\mathcal {P}} (x))^T\sum \limits _{j = 0}^{m-\beta } \left( {\begin{array}{c}\beta -1+j\\ \beta -1\end{array}}\right) \ (A^TA)_{\underline{\mathbf {i_j}}} (x-{\mathcal {P}} (x) ) \\&= (x-{\mathcal {P}} (x) )^T {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ a_{i^*}a_{i^*}^T\right] (x-{\mathcal {P}} (x) ) \\&\overset{\text {Lemma} \ 4 \ \& \ 5 }{\le } \ \min \left\{ 1, \frac{\beta }{m} \lambda _{\max }\right\} \big \Vert x- {\mathcal {P}} (x) \big \Vert ^2 \\&= \ \min \left\{ 1, \frac{\beta }{m} \lambda _{\max }\right\} d(x,P)^2. \end{aligned}$$

Combining the above identities and using the expression for f(x) from (9) we get

$$\begin{aligned} \frac{\mu _1}{ 2} \ d(x,P)^2 \ \le \ f(x) \ \le \ \frac{\mu _2}{2} \ d(x,P)^2, \end{aligned}$$

which proves the Lemma. \(\square \)

Proof of Lemma 7

Considering the definitions of f(x) and g(y), we have

$$\begin{aligned} \langle x-y, g(y) \rangle&= \big \langle x-y, {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ (a_{i^*}^Ty-b_{i^*})^{+} a_{i^*}\right] \big \rangle = \sum \limits _{i: a_i^Ty-b_i> 0} p_i \big \langle x-y, (a_{i}^Ty-b_{i}) a_{i} \big \rangle \\&\le \ \sum \limits _{i: a_i^Ty-b_i> 0} p_i (a_{i}^Ty-b_{i}) (a_{i}^Tx-b_{i})- \sum \limits _{i: a_i^Ty-b_i> 0} p_i (a_{i}^Ty-b_{i})^2 \\&\le \ \sum \limits _{i: a_i^Ty-b_i> 0} p_i (a_{i}^Ty-b_{i}) (a_{i}^Tx-b_{i})^+- \sum \limits _{i: a_i^Ty-b_i> 0} p_i (a_{i}^Ty-b_{i})^2 \\&\le \ \sum \limits _{i: a_i^Ty-b_i> 0} p_i \frac{\big | (a_{i}^Tx-b_{i})^+ \big |^2}{2}- \sum \limits _{i: a_i^Ty-b_i > 0} p_i \frac{(a_{i}^Ty-b_{i})^2}{2} \\&\le \ \sum \limits _{i} p_i \frac{\big | (a_{i}^Tx-b_{i})^+ \big |^2}{2}- \sum \limits _{i} p_i \frac{\big | (a_{i}^Ty-b_{i})^+ \big |^2}{2} \\&\le f(x) - f(y) \overset{\text {Lemma} \ 6 }{\le } \ \frac{\mu _2}{2} \ d(x,P)^2 - \frac{\mu _1}{2} \ d(y,P)^2. \end{aligned}$$

where \(0 \le p_i \le 1\) denote the corresponding probabilities for all i. Here, in the third inequality we used the relation \(2ab \le a^2 + b^2\). This completes the proof. \(\square \)

Proof of Lemma 8

Since, \({\bar{y}} \in P\), from the definition we have,

$$\begin{aligned} \langle {\bar{y}}-y, {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ a_{i^*}(a_{i^*}^Ty-b_{i^*})^{+}\right] \rangle \&= {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ (a_{i^*}^Ty-b_{i^*})^{+} \left( a_{i^*}^T{\bar{y}}- a_{i^*}^T y\right) \right] \\&\le {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ (a_{i^*}^Ty-b_{i^*})^{+} \left( b_{i^*}- a_{i^*}^T y\right) \right] \\&= - {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ \big | (a_{i^*}^Ty-b_{i^*})^{+} \big |^2\right] \\&= - 2f(y) \overset{\text {Lemma} \ 6 }{\le } \ \ -\mu _1 \ d(y,P)^2. \end{aligned}$$

Here, we used the identity \(xx^+ = |x^+|^2\). This proves the Lemma. \(\square \)

Proof of Lemma 9

Since, \({\mathcal {P}}(x) \in P\), we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ d(z,P)^2\right]&\overset{\text {Lemma} \ 3}{\le } \ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ \Vert z- {\mathcal {P}}(x)\Vert ^2\right] = {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ \big \Vert x- {\mathcal {P}} (x) - \delta \left( a_{i^*}^Tx-b_{i^*}\right) ^+ a_{i^*} \big \Vert ^2 \right] \\&\overset{(9)}{=} \ \Vert x - {\mathcal {P}} (x) \Vert ^2 + 2\delta ^2 f(x) + 2 \delta \ \big \langle {\mathcal {P}}(x)-x, g(x) \big \rangle \\&\overset{\text {Lemma} \ 8 }{\le } \ \Vert x - {\mathcal {P}} (x) \Vert ^2 -2(2\delta -\delta ^2) f(x) \\&\overset{\text {Lemma} \ 6 }{\le } \ \Vert x- {\mathcal {P}} (x) \Vert ^2 - (2\delta -\delta ^2) \ \mu _1 \Vert x- {\mathcal {P}} (x) \Vert ^2 = h(\delta ) \ d(x,P)^2. \end{aligned}$$

Here, we used the lower bound of the expected value from Lemma 6. \(\square \)

Proof of Theorem 1

Since, \(\phi _1, \phi _2 \ge 0\), the largest root \(\phi \) of equation \(\phi ^2 + \phi _1 \phi - \phi _2 = 0\) can be written as

$$\begin{aligned} \phi = \frac{-\phi _1+ \sqrt{\phi _1^2+4\phi _2}}{2} \ge \frac{-\phi _1+ \phi _1}{2} = 0. \end{aligned}$$

Then, using the given recurrence, we have,

$$\begin{aligned} G_{k+1} + \phi G_k&\le (\phi +\phi _1) G_k + \phi _2 G_{k-1} \\&= (\phi +\phi _1) \left( G_{k} + \phi G_{k-1}\right) \\&\vdots \\&\le (\phi +\phi _1)^k \left( G_{1} + \phi G_{0}\right) \\&= (\phi +\phi _1)^k (1+\phi ) G_0. \end{aligned}$$

This proves the first part. Also note that since \(\phi _1 + \phi _2 < 1\), we have,

$$\begin{aligned} \phi + \phi _1&= \frac{\phi _1+ \sqrt{\phi _1^2+4\phi _2}}{2} < \frac{\phi _1+ \sqrt{\phi _1^2+4(1-\phi _1)}}{2} = \frac{\phi _1+2-\phi _1}{2} = 1. \end{aligned}$$

For the second part, notice that from the recurrence inequality, we can deduce the following matrix inequality:

$$\begin{aligned} \begin{bmatrix} G_{k+1} \\ G_{k} \end{bmatrix} \le \begin{bmatrix} \phi _1^2+\phi _2 &{} \phi _1 \phi _2 \\ \phi _1 &{} \phi _2 \end{bmatrix} \begin{bmatrix} G_{k-1} \\ G_{k-2} \end{bmatrix}. \end{aligned}$$
(34)

The Jordan decomposition of the matrix in the above expression is given by

$$\begin{aligned} \begin{bmatrix} \phi _1^2+\phi _2 &{} \phi _1 \phi _2 \\ \phi _1 &{} \phi _2 \end{bmatrix} = \begin{bmatrix} -\phi &{} \phi + \phi _1 \\ 1 &{} 1 \end{bmatrix} \begin{bmatrix} \phi ^2 &{} 0 \\ 0 &{} \rho ^2 \end{bmatrix} \begin{bmatrix} \frac{-1}{\phi _1+2 \phi } &{} \frac{1}{2}+ \frac{\phi _1}{2(\phi _1+2 \phi )} \\ \frac{1}{\phi _1+2 \phi } &{} \frac{1}{2}-\frac{\phi _1}{2(\phi _1+2 \phi )} \end{bmatrix}. \end{aligned}$$
(35)

Here we substituted \(\phi _2 = \phi (\phi +\phi _1)\) in the Jordan decomposition of Eq. (35). Next, we discuss two possible cases of values of k.

Case 1 k even

$$\begin{aligned} \begin{bmatrix} G_{k+1} \\ G_{k} \end{bmatrix}&\overset{(34)}{\le } \begin{bmatrix} \phi _1^2+\phi ^2 + \phi \phi _1 &{} \phi ^2 \phi _1 + \phi \phi _1^2 \\ \phi _1 &{} \phi ^2 + \phi \phi _1 \nonumber \\ \end{bmatrix} \begin{bmatrix} G_{k-1} \\ G_{k-2} \end{bmatrix} \nonumber \\&\quad \vdots \nonumber \\&\le \ \begin{bmatrix} \phi _1^2+\phi ^2 + \phi \phi _1 &{} \phi ^2 \phi _1 + \phi \phi _1^2 \\ \phi _1 &{} \phi ^2 + \phi \phi _1 \nonumber \end{bmatrix}^{\frac{k}{2}} \begin{bmatrix} G_{1} \\ G_{0} \end{bmatrix} \\&\overset{(35) }{=} \begin{bmatrix} -\phi &{} \phi + \phi _1 \\ 1 &{} 1 \end{bmatrix} \begin{bmatrix} \phi ^k &{} 0 \\ 0 &{} \rho ^k \end{bmatrix} \begin{bmatrix} \frac{-1}{\phi _1+2 \phi } &{} \frac{1}{2}+ \frac{\phi _1}{2(\phi _1+2 \phi )} \\ \frac{1}{\phi _1+2 \phi } &{} \frac{1}{2}-\frac{\phi _1}{2(\phi _1+2 \phi )} \end{bmatrix} \begin{bmatrix} G_{0} \\ G_{0} \end{bmatrix} \nonumber \\&= \begin{bmatrix} (1+\phi ) \rho ^{k+1} + (1-\phi -\phi _1) \phi ^{k+1} \\ (1+\phi ) \rho ^{k} - (1-\phi -\phi _1) \phi ^{k} \end{bmatrix} \begin{bmatrix} \frac{G_{0}}{\phi _1+ 2 \phi } \end{bmatrix} \nonumber \\&\overset{(23) }{=} \begin{bmatrix} R_1 \rho ^{k+1}+ R_2 \phi ^{k+1} \\ R_1 \rho ^{k}- R_2 \phi ^{k} \end{bmatrix} \ G_0. \end{aligned}$$
(36)

Here, we used \(G_0 = G_1\).

Case 2 k odd

$$\begin{aligned} \begin{bmatrix} G_{k+1} \\ G_{k} \end{bmatrix}&\overset{(34)}{\le } \begin{bmatrix} \phi _1^2+\phi ^2 + \phi \phi _1 &{} \phi ^2 \phi _1 + \phi \phi _1^2 \\ \phi _1 &{} \phi ^2 + \phi \phi _1 \\ \end{bmatrix} \begin{bmatrix} G_{k-1} \\ G_{k-2} \end{bmatrix} \nonumber \\&\quad \vdots \nonumber \\&\le \ \begin{bmatrix} \phi _1^2+\phi ^2 + \phi \phi _1 &{} \phi ^2 \phi _1 + \phi \phi _1^2 \\ \phi _1 &{} \phi ^2 + \phi \phi _1 \end{bmatrix}^{\frac{k-1}{2}} \begin{bmatrix} G_{2} \\ G_{1} \end{bmatrix} \nonumber \\&\overset{(35)}{\le } \begin{bmatrix} -\phi &{} \phi + \phi _1 \\ 1 &{} 1 \end{bmatrix} \begin{bmatrix} \phi ^{k-1} &{} 0 \\ 0 &{} \rho ^{k-1} \end{bmatrix} \begin{bmatrix} \frac{-1}{\phi _1+2 \phi } &{} \frac{1}{2}+ \frac{\phi _1}{2(\phi _1+2 \phi )} \\ \frac{1}{\phi _1+2 \phi } &{} \frac{1}{2}-\frac{\phi _1}{2(\phi _1+2 \phi )} \end{bmatrix} \begin{bmatrix} (\phi _1+\phi _2)G_{0} \\ G_{0} \end{bmatrix} \nonumber \\&= \begin{bmatrix} (\phi +\phi _1+\phi _2)\rho ^{k} - (\phi -\phi _2) \phi ^{k} \\ (\phi +\phi _1+\phi _2)\rho ^{k-1} + (\phi -\phi _2) \phi ^{k-1} \end{bmatrix} \begin{bmatrix} \frac{G_{0}}{\phi _1+ 2 \phi } \end{bmatrix} \nonumber \\&\overset{(23)}{=} \begin{bmatrix} R_3 \rho ^{k}- R_4 \phi ^{k} \\ R_3 \rho ^{k-1} + R_4 \phi ^{k-1} \end{bmatrix} \ G_0. \end{aligned}$$
(37)

Here, we used the inequality \(G_2 \le \phi _1 G_1 + \phi _2 G_0\). Now, combining the relations from Eq. (36) and (37), we can prove the second part of Theorem 1. \(\square \)

Proof of Theorem 2

From the given recurrence relation, we have

$$\begin{aligned} \begin{bmatrix} H_{k+1} \\ F_{k+1} \end{bmatrix}&\le \begin{bmatrix} \varPi _1 &{} \varPi _2 \\ \varPi _3 &{} \ \varPi _4 \end{bmatrix} \begin{bmatrix} H_{k} \\ F_{k} \end{bmatrix} \le \begin{bmatrix} \varPi _1 &{} \varPi _2 \\ \varPi _3 &{} \ \varPi _4 \end{bmatrix}^k \begin{bmatrix} H_{1} \\ F_{1} \end{bmatrix}. \end{aligned}$$
(38)

Using the definitions of (26), we can write the Jordan decomposition of the above matrix as follows:

$$\begin{aligned} \begin{bmatrix} \varPi _1 &{} \varPi _2 \\ \varPi _3 &{} \ \varPi _4 \end{bmatrix} = \begin{bmatrix} \varGamma _2 &{} \varGamma _1 \\ 1 &{} \ 1 \end{bmatrix} \begin{bmatrix} \rho _1 &{} 0 \\ 0 &{} \rho _2 \end{bmatrix} \begin{bmatrix} -\varGamma _3 &{} \varGamma _1 \varGamma _3 \\ \varGamma _3 &{} \ -\varGamma _2 \varGamma _3 \end{bmatrix}. \end{aligned}$$
(39)

Now, substituting the matrix decomposition into Eq. (38) and simplifying, we have

$$\begin{aligned} \begin{bmatrix} H_{k+1} \\ F_{k+1} \end{bmatrix}&\le \ \begin{bmatrix} \varPi _1 &{} \varPi _2 \\ \varPi _3 &{} \ \varPi _4 \end{bmatrix}^k \begin{bmatrix} H_{1} \\ F_{1} \end{bmatrix} \overset{(39) }{=} \begin{bmatrix} \varGamma _2 &{} \varGamma _1 \\ 1 &{} \ 1 \end{bmatrix} \begin{bmatrix} \rho _1^k &{} 0 \\ 0 &{} \rho _2^k \end{bmatrix} \begin{bmatrix} -\varGamma _3 &{} \varGamma _1 \varGamma _3 \\ \varGamma _3 &{} \ -\varGamma _2 \varGamma _3 \end{bmatrix} \begin{bmatrix} H_{1} \\ F_1 \end{bmatrix} \nonumber \\&= \begin{bmatrix} \varGamma _3 (\varGamma _1 \rho _2^{k} -\varGamma _2 \rho _1^{k}) \ \ &{} \ \ \varGamma _1 \varGamma _2 \varGamma _3 (\rho _1^{k}- \rho _2^{k}) \\ \varGamma _3 (\rho _2^{k}- \rho _1^{k}) \ &{} \ \varGamma _3 (\varGamma _1 \rho _1^{k} -\varGamma _2 \rho _2^{k}) \end{bmatrix} \ \begin{bmatrix} H_{1} \\ F_1 \end{bmatrix}. \end{aligned}$$
(40)

Since \(\varPi _1, \varPi _2, \varPi _3, \varPi _4 \ge 0\), one can easily verify that \(\varGamma _1, \varGamma _3 \ge 0\). Now, it remains to show that \( 0 \le \rho _1 \le \rho _2 < 1\). To show that, first note that

$$\begin{aligned} (\varPi _1 - \varPi _4)^2 + 4\varPi _2 \varPi _3&\overset{(25) }{<} (\varPi _1 - \varPi _4)^2 + 4 - 4 \varPi _1 \varPi _4 - 4(\varPi _1 + \varPi _4) \nonumber \\&= \left( 2-\varPi _1 - \varPi _4\right) ^2. \end{aligned}$$
(41)

Now we have,

$$\begin{aligned} \rho _1&= \frac{1}{2} \left[ \varPi _1+\varPi _4 - \sqrt{(\varPi _1-\varPi _4)^2+4\varPi _2\varPi _3}\right] \\&\ge \frac{1}{2} \left[ \varPi _1+\varPi _4 - \sqrt{(\varPi _1-\varPi _4)^2+4\varPi _1\varPi _4}\right] = \frac{1}{2} \left[ \varPi _1+\varPi _4 - (\varPi _1+\varPi _4)\right] =0. \end{aligned}$$

Moreover, since \(2-\varPi _1 - \varPi _4 \ge 0 \), we have

$$\begin{aligned} \rho _1 \le \ \rho _2&= \frac{1}{2} \left[ \varPi _1+\varPi _4 + \sqrt{(\varPi _1-\varPi _4)^2+4\varPi _2\varPi _3}\right] \\&\overset{(41) }{<} \frac{1}{2} \left[ \varPi _1+\varPi _4 + \sqrt{ \left( 2-\varPi _1 - \varPi _4\right) ^2 }\right] \\&\overset{(25) }{=} \frac{1}{2} \left[ \varPi _1+\varPi _4 + 2-\varPi _1 - \varPi _4 \right] = 1. \end{aligned}$$

As \(0 \le \rho _1 \le \rho _2 < 1\), considering (40) we can deduce that the sequence \(\{H_k\}\) and \( \{F_k\}\) converges. \(\square \)

Appendix 2

Proof of Theorem 3

From the update formula of Algorithm 2, we have \(z_{k} = x_k - (A_{\tau _k}x_k - b_{\tau _k})^+_{i^*} a_{i^*}\) where,

$$\begin{aligned} i^* = {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{i \in \tau _k}} \{a_i^Tx_k-b_i, 0\} \ = \ {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{i \in \tau _k}} (A_{\tau _k}x_k-b_{\tau _k})^+_i. \end{aligned}$$
(42)

Similarly, the previous update formula can be written as, \(z_{k-1} = x_{k-1} - (A_{\tau _{k-1}}x_{k-1} - b_{\tau _{k-1}})^+_{j^*} a_{j^*}\); where

$$\begin{aligned} j^* = {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{j \in \tau _{k-1}}} \{a_j^Tx_{k-1}-b_j, 0\} \ = \ {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{j \in \tau _{k-1}}} (A_{\tau _{k-1}}x_{k-1}-b_{\tau _{k-1}})^+_j. \end{aligned}$$
(43)

Note that the notation is consistent with the definition of (6). Since for any \(\xi \in Q_1\), \( (1-\xi ) {\mathcal {P}}(x_{k}) + \xi {\mathcal {P}}(x_{k-1}) \in P\) we have,

$$\begin{aligned} d(x_{k+1},P)^2&= \ \ \big \Vert x_{k+1}- {\mathcal {P}}(x_{k+1}) \big \Vert ^2 \nonumber \\&\overset{\text {Lemma} \ 3}{ \le } \ \big \Vert x_{k+1}- (1-\xi ) {\mathcal {P}}(x_{k}) - \xi {\mathcal {P}}(x_{k-1})\big \Vert ^2 \nonumber \\&\overset{(10)}{=} \big \Vert (1-\xi ) z_k + \xi z_{k-1}-(1-\xi ) {\mathcal {P}}(x_{k}) - \xi {\mathcal {P}}(x_{k-1}) \big \Vert ^2 \nonumber \\&\overset{(11)}{=} \big \Vert (1-\xi ) \left\{ x_k- {\mathcal {P}}(x_{k})- \delta \left( a_{i^*}^Tx_{k}-b_{i^*}\right) ^+ a_{i^*}\right\} \nonumber \\&\quad + \xi \left\{ x_{k-1}- {\mathcal {P}}(x_{k-1})- \delta \left( a_{j^*}^Tx_{k-1}-b_{j^*}\right) ^+ a_{j^*}\right\} \big \Vert ^2 \nonumber \\&\le (1-\xi ) \ \big \Vert x_k- {\mathcal {P}}(x_{k})- \delta \left( a_{i^*}^Tx_{k}-b_{i^*}\right) ^+ a_{i^*} \big \Vert ^2 \nonumber \\&\quad + \xi \ \big \Vert x_{k-1}- {\mathcal {P}}(x_{k-1})- \delta \left( a_{j^*}^Tx_{k-1}-b_{j^*}\right) ^+ a_{j^*} \big \Vert ^2. \end{aligned}$$
(44)

We used the fact that the function \(\Vert \cdot \Vert ^2\) is convex and \(0 \le \xi \le 1\). Now, taking expectation in both sides of the Eq. (44) and using Lemma 9, we get the following:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[d(x_{k+1},P)^2 \ |&\ {\mathbb {S}}_{k}, {\mathbb {S}}_{k-1}] \le (1-\xi ) \ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ \big \Vert x_k- {\mathcal {P}}(x_{k})- \delta \left( a_{i^*}^Tx_{k}-b_{i^*}\right) ^+ a_{i^*} \big \Vert ^2 \right] \nonumber \\&+ \xi \ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k-1}} \left[ \big \Vert x_{k-1}- {\mathcal {P}}(x_{k-1})- \delta \left( a_{j^*}^Tx_{k-1}-b_{j^*}\right) ^+ a_{j^*} \big \Vert ^2\right] \nonumber \\&\overset{\text {Lemma} \ 9}{\le } (1-\xi ) \ h(\delta ) \ d(x_k,P)^2 + \xi \ h(\delta ) \ d(x_{k-1}, P)^2, \end{aligned}$$
(45)

where \(h(\delta )\) is defined in Lemma 9. Taking expectation again in Eq. (45) and letting \(G_{k+1} = {{\,\mathrm{\mathbb {E}}\,}}\left[ d(x_{k+1},P)^2\right] \), we get the following:

$$\begin{aligned} G_{k+1} \le \phi _1 G_{k} + \phi _2 G_{k-1}. \end{aligned}$$
(46)

Since, \(\phi _1 , \phi _2 \ge 0\), \(\phi _1+ \phi _2 < 1\) and \(z_0 = z_1\), using first part of Theorem 1, we have the following:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ d(x_{k+1},P)^2\right] \le (1+\phi ) (\phi + \phi _1)^k G_0 = (1+\phi ) \rho ^k {{\,\mathrm{\mathbb {E}}\,}}\left[ d(x_{0},P)^2\right] . \end{aligned}$$
(47)

Moreover, considering (47) with Lemma 6 we get the bound of \({{\,\mathrm{\mathbb {E}}\,}}[f(x_k)]\) which proves the first part of Theorem 3. Furthermore, using the second part of Theorem 1 and Eq. (46), we get the second part of Theorem 3. Now, to prove the third part first note that \(\frac{1}{k} \sum \limits _{l=1}^{k} {\mathcal {P}}(x_l) \in P\), using Lemma 3 we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[d(\tilde{x}_k,P)^2]&= {{\,\mathrm{\mathbb {E}}\,}}[\Vert \tilde{x}_k- {\mathcal {P}}(\tilde{x}_k)\Vert ^2] \overset{\text {Lemma} \ 3}{ \le } {{\,\mathrm{\mathbb {E}}\,}}\left[ \Big \Vert \frac{1}{k} \sum \limits _{l=1}^{k} \left( x_l-{\mathcal {P}}(x_l)\right) \Big \Vert ^2\right] \nonumber \\&\le {{\,\mathrm{\mathbb {E}}\,}}\left[ \frac{1}{k} \sum \limits _{l=1}^{k} \big \Vert x_l-{\mathcal {P}}(x_l)\big \Vert ^2\right] = \frac{1}{k} \sum \limits _{l=1}^{k} {{\,\mathrm{\mathbb {E}}\,}}[d(x_l,P)^2] \nonumber \\&\le \frac{(1+\phi ) d(x_0,P)^2}{k} \sum \limits _{l=1}^{k} \rho ^{l-1} \le \frac{(1+\phi ) d(x_0,P)^2}{ k(1-\rho )}. \end{aligned}$$
(48)

Here we use the geometric series bound along with the fact that \(0< \rho < 1\). Furthermore, using a more simplified version of (44) we have the following:

$$\begin{aligned} G_{l+1} - G_l \le \xi (G_l-G_{l-1})+ 2 \xi \delta (2-\delta ) [f(x_l)-f(x_{l-1})] - 2 \delta (2-\delta ) f(x_{l}), \end{aligned}$$

for any \(l \ge 1\). Summing up the above identity for \(l =1,2,\ldots ,k\), we have the following:

$$\begin{aligned} 2 \delta (2-\delta ) \sum \limits _{l=1}^{k} f(x_l)&\le \xi G_0 + G_1- \xi G_k- G_{k+1}+ 2 \xi \delta (2-\delta ) [f(x_k)-f(x_{0})] \nonumber \\&\le (1+\xi ) G_0 + \xi [2 \eta f(x_k)-G_k] \nonumber \\&\le (1+\xi ) G_0 + \xi [\eta \mu _2 -1] G_k \le (1+\xi ) d(x_0,P)^2, \end{aligned}$$
(49)

where \(\eta = 2\delta - \delta ^2 \le 1\). Also, from Lemma 6, we get \(\mu _2 = \min \{1, \frac{\beta }{m} \lambda _{\max }\} \le 1\). This implies that \(\eta \mu _2 \le 1\) holds. Moreover, we used the non-negativity of the sequences \(G_k\) and \( f(x_k)\). We also used the upper bound from Lemma 6. Then we get

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[f(\tilde{x}_k)]&\le {{\,\mathrm{\mathbb {E}}\,}}\left[ \frac{1}{k} \sum \limits _{l=1}^{k} f(x_l)\right] = \frac{1}{k} \sum \limits _{l=1}^{k} {{\,\mathrm{\mathbb {E}}\,}}[f(x_l)] \le \frac{(1+\xi ) \ d(x_0,P)^2}{2 \delta k (2-\delta )}. \end{aligned}$$

This proves the second part of Theorem 3. \(\square \)

Proof of Theorem 5

For any natural number \(l \ge 1\) define, \(\vartheta _l = \frac{\xi }{1+\xi }[x_{l-1}-x_l-\delta (a_{j^*}^Tx_{l-1}-b_{j^*})^+a_{j^*}]\), \( \ \varDelta _l = x_l + \vartheta _l\) and \(\chi _l = \Vert x_l+\vartheta _l-{\mathcal {P}}(\varDelta _l)\Vert ^2\), then using the update formulas (10) and (11), we have

$$ \begin{aligned} x_{l+1} + \vartheta _{l+1} \overset{(10) \ \& \ (11)}{=} x_l + \vartheta _l - \frac{\delta }{1+\xi } \left( a_{i^*}^Tx_l-b_{i^*}\right) ^+ a_{i^*}, \end{aligned}$$

here, the index \(i^{*}\) and \(j^*\) are defined based on (6) respectively for the iterates \(x_l\) and \(x_{l-1}\). Using the above relation, we can write

$$\begin{aligned} \chi _{l+1}&= \Vert x_{l+1}+\vartheta _{l+1}-{\mathcal {P}}(\varDelta _{l+1})\Vert ^2 \overset{\text {Lemma} \ 3}{ \le } \Vert x_{l+1}+\vartheta _{l+1}-{\mathcal {P}}(\varDelta _l)\Vert ^2 \nonumber \\&= \big \Vert x_l + \vartheta _l - \frac{\delta }{1+\xi } \left( a_{i^*}^Tx_l-b_{i^*}\right) ^+ a_{i^*}-{\mathcal {P}}(\varDelta _l) \big \Vert ^2 \nonumber \\&= \underbrace{\Vert x_l+\vartheta _l-{\mathcal {P}}(\varDelta _l)\Vert ^2}_{= \chi _l} + \frac{\delta ^2}{(1+\xi )^2} \underbrace{\Vert (a_{i^*}^Tx_l- b_{i^*})^+a_{i^*}\Vert ^2}_{J_1} \nonumber \\&\qquad - \frac{2 \delta }{1+\xi } \underbrace{\big \langle x_l+\vartheta _l-{\mathcal {P}}(\varDelta _l) \ ,\ a_{i^*} (a_{i^*}^Tx_l-b_{i^{*}})^{+} \big \rangle }_{J_2} \nonumber \\&= \chi _l + \frac{\delta ^2}{(1+\xi )^2} J_1- \frac{2 \delta }{1+\xi } J_2. \end{aligned}$$
(50)

Taking expectation with respect to \({\mathbb {S}}_l\) we have,

$$\begin{aligned} \frac{\delta ^2}{(1+\xi )^2} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_l} [J_1] \overset{(9) }{=} \frac{2\delta ^2}{(1+\xi )^2} f(x_l). \end{aligned}$$
(51)

Similarly, we can simplify the third term of (50) as

$$ \begin{aligned}&- \frac{2\delta }{1+\xi } {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_l} [J_2] \nonumber \\&\quad \overset{(9) }{=} -\frac{2\delta }{1+\xi } \big \langle x_l-{\mathcal {P}}(\varDelta _l) , g(x_l) \big \rangle - \frac{2\delta \xi }{(1+\xi )^2} \big \langle x_{l-1}-x_l - \delta g(x_{l-1}) , g(x_l) \big \rangle \nonumber \\&\quad = -\frac{2\delta }{1+\xi } \big \langle x_l-{\mathcal {P}}(\varDelta _l) , g(x_l) \big \rangle - \frac{2\delta \xi }{(1+\xi )^2} \big \langle x_{l-1}-x_l , g(x_l) \big \rangle \nonumber \\&\quad \quad \quad \quad \quad +\frac{\delta ^2 \xi }{(1+\xi )^2} \left[ \Vert g(x_l)+g(x_{l-1})\Vert ^2-\Vert g(x_l)\Vert ^2-\Vert g(x_{l-1})\Vert ^2 \right] \nonumber \\&\quad \overset{\text {Lemma} \ 7 \ \& \ 8}{\le } - \frac{4\delta }{1+\xi } f(x_l) - \frac{2\delta \xi }{(1+\xi )^2} \left[ f(x_{l-1}) - f(x_{l}) \right] - \frac{2\delta ^2 \xi }{(1+\xi )^2} \left[ f(x_{l-1}) + f(x_{l}) \right] \nonumber \\&\quad = -\frac{2 \delta \xi (1+\delta )}{(1+\xi )^2} f(x_{l-1}) +\frac{2 \delta \xi (1+\delta )}{(1+\xi )^2} f(x_{l}) - \frac{4 \delta (1+\xi + \delta \xi )}{(1+\xi )^2} f(x_{l}). \end{aligned}$$
(52)

Using the expressions of Eqs. (51) and (52) in (50) and simplifying further, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_l}[\chi _{l+1}] - \frac{2\delta \xi (1+\delta )}{(1+\xi )^2} f(x_l) + \varpi f(x_l) \ \le \ \chi _l - \frac{2\delta \xi (1+\delta )}{(1+\xi )^2} f(x_{l-1}), \end{aligned}$$
(53)

here, the constant \(\varpi \) is defined as

$$\begin{aligned} \varpi = \frac{4 \delta (1+\xi + \delta \xi )}{(1+\xi )^2} - \frac{2 \delta ^2 }{(1+\xi )^2} = \frac{2 \delta (2+2 \xi + 2\delta \xi -\delta )}{(1+\xi )^2} \ > \ 0. \end{aligned}$$
(54)

Now, taking expectation again in (53) and using the tower property, we get

$$\begin{aligned} q_{l+1} + \varpi {{\,\mathrm{\mathbb {E}}\,}}[f(x_l)] \le q_l, \quad l = 1,2,3\ldots , \end{aligned}$$
(55)

where \(q_l = {{\,\mathrm{\mathbb {E}}\,}}[\chi _l] - \frac{2\delta \xi (1+\delta )}{(1+\xi )^2} {{\,\mathrm{\mathbb {E}}\,}}[f(x_{l-1})]\). Summing up (55) for \(l=1,2,\ldots ,k\) we get

$$\begin{aligned} \sum \limits _{l=1}^{k} {{\,\mathrm{\mathbb {E}}\,}}[f(x_l)] \ \le \ \frac{q_1-q_{k+1}}{\varpi } \ \le \ \frac{q_1}{\varpi }. \end{aligned}$$
(56)

Now, using Jensen’s inequality, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ f(\bar{x_k})\right] = {{\,\mathrm{\mathbb {E}}\,}}\left[ f\left( \sum \limits _{l=1}^{k} \frac{x_l}{k}\right) \right] \ \le \ {{\,\mathrm{\mathbb {E}}\,}}\left[ \frac{1}{k} \sum \limits _{l=1}^{k}f(x_l)\right] \ = \ \frac{1}{k} \sum \limits _{l=1}^{k} {{\,\mathrm{\mathbb {E}}\,}}[f(x_l)] \ \overset{(56)}{\le } \frac{q_1}{\varpi k}. \end{aligned}$$

Since, \(x_0=x_1\), we have \(\vartheta _1 = \frac{-\delta \xi }{1+\xi } (a_{i^*}^Tx_0-b_{i^*})^+ a_{i^*} \). Furthermore,

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[\chi _1]&= {{\,\mathrm{\mathbb {E}}\,}}\left[ \Vert x_1+\vartheta _1 -{\mathcal {P}}(\varDelta _1)\Vert ^2 \right] \overset{\text {Lemma} \ 3}{ \le } {{\,\mathrm{\mathbb {E}}\,}}\left[ \Vert x_1+\vartheta _1 -{\mathcal {P}}(x_0)\Vert ^2 \right] \nonumber \\&= {{\,\mathrm{\mathbb {E}}\,}}\left[ \Vert x_0 -{\mathcal {P}}(x_0)-\frac{\delta \xi }{1+\xi } (a_{i^*}^Tx_0-b_{i^*})^+ a_{i^*} \Vert ^2 \right] \nonumber \\&= \Vert x_0 -{\mathcal {P}}(x_0)\Vert ^2 + \frac{\delta ^2\xi ^2}{(1+\xi )^2} {{\,\mathrm{\mathbb {E}}\,}}[|(a_{i^*}^Tx_0-b_{i^*})^+|^2] \nonumber \\&\quad \quad \quad \quad \quad \quad - \frac{2\delta \xi }{1+\xi } \langle x_0 -{\mathcal {P}}(x_0) , {{\,\mathrm{\mathbb {E}}\,}}[(a_{i^*}^Tx_0-b_{i^*})^+ a_{i^*}] \rangle \nonumber \\&\overset{\text {Lemma} \ 6 }{\le } \Vert x_0 -{\mathcal {P}}(x_0)\Vert ^2 + \frac{2\delta ^2\xi ^2}{(1+\xi )^2} f(x_0)-\frac{2\delta \xi \mu _2}{1+\xi } \Vert x_0 -{\mathcal {P}}(x_0)\Vert ^2. \end{aligned}$$
(57)

Now, from our construction we get

$$\begin{aligned} q_1 = {{\,\mathrm{\mathbb {E}}\,}}[\chi _1] - \frac{2\delta \xi (1+\delta )}{(1+\xi )^2} {{\,\mathrm{\mathbb {E}}\,}}[f(x_{0})] \le (1-\frac{2\delta \xi \mu _2}{1+\xi }) \ d(x_0,P)^2+ \frac{2\delta \xi (\delta \xi -1-\delta )}{(1+\xi )^2} f(x_0). \end{aligned}$$

Substituting the values of \(\varpi \) and \(q_1\) in the expression of \({{\,\mathrm{\mathbb {E}}\,}}\left[ f(\bar{x_k})\right] \), we have the following:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ f({\bar{x}}_k)\right] \le \frac{ (1+\xi ) (1+\xi -2\delta \xi \mu _2) \ d(x_0,P)^2+ 2 \xi \delta (\delta \xi -\delta -1) f(x_0)}{2 \delta k \left( 2+2 \xi + 2 \delta \xi -\delta \right) }. \end{aligned}$$

\(\square \)

Proof of Theorem 4

Since the term \(\Vert x_{k+1}- {\mathcal {P}}(x_{k+1}) \Vert \) is constant under the expectation with respect to \({\mathbb {S}}_{k+1}\), from the update formula of the GSKM algorithm, we get

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[ \Vert x_{k+1}- {\mathcal {P}}(x_{k+1})&\Vert \ | \ {\mathbb {S}}_{k+1}, {\mathbb {S}}_k] = {{\,\mathrm{\mathbb {E}}\,}}[ \Vert x_{k+1}- {\mathcal {P}}(x_{k+1}) \Vert \ | \ {\mathbb {S}}_k] \nonumber \\&\overset{\text {Lemma} \ 3}{ \le }\ \ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} [\Vert x_{k+1}- {\mathcal {P}}(x_{k}) \Vert ] \nonumber \\&= {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} [\Vert z_k-{\mathcal {P}}(x_k)- \xi (z_k-z_{k-1}) \Vert ] \nonumber \\&\le {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k}[\Vert z_k-{\mathcal {P}}(x_k)\Vert ] + |\xi | {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} [\Vert z_k-z_{k-1}\Vert ] \nonumber \\&\le \left\{ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k}[ \Vert z_k-{\mathcal {P}}(x_k)\Vert ^2]\right\} ^{\frac{1}{2}} + |\xi | \Vert z_k-z_{k-1}\Vert \nonumber \\&\overset{\text {Lemma} \ 9 }{\le } \sqrt{h(\delta )} \ \Vert x_k-{\mathcal {P}}(x_k)\Vert + |\xi | \Vert z_k-z_{k-1}\Vert . \end{aligned}$$
(58)

We performed the two expectations in order, from the innermost to the outermost. Now, taking expectation in (58) and using the tower property of expectation, we have,

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_{k+1}-{\mathcal {P}}(x_{k+1})\Vert ]&\le \sqrt{h(\delta )} \ {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_k-{\mathcal {P}}(x_k)\Vert ] + |\xi | \ {{\,\mathrm{\mathbb {E}}\,}}[\Vert z_k-z_{k-1}\Vert ]. \end{aligned}$$
(59)

Similarly, using the update formula for \(z_{k+1}\), we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[\Vert z_{k+1}&-z_k \Vert \ | \ {\mathbb {S}}_{k+1}, {\mathbb {S}}_k] = {{\,\mathrm{\mathbb {E}}\,}}[{{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k+1}}[\Vert x_{k+1} - \delta \left( a_{i^*}^Tx_{k+1}-b_{i^*}\right) ^+ a_{i^*}-z_k \Vert ] \ | \ {\mathbb {S}}_k] \nonumber \\&= {{\,\mathrm{\mathbb {E}}\,}}[{{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k+1}} [\Vert -\xi (z_k-z_{k-1}) - \delta \left( a_{i^*}^Tx_{k+1}-b_{i^*}\right) ^+ a_{i^*}\Vert ] \ | \ {\mathbb {S}}_k] \nonumber \\&\le |\xi | \ \Vert z_k-z_{k-1}\Vert + \delta {{\,\mathrm{\mathbb {E}}\,}}[{{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k+1}}[|(a_{i^*}^Tx_{k+1}-b_{i^*})^+ |] \ | \ {\mathbb {S}}_k] \nonumber \\&\le |\xi | \ \Vert z_k-z_{k-1}\Vert + \delta {{\,\mathrm{\mathbb {E}}\,}}[\left\{ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k+1}}[|(a_{i^*}^Tx_{k+1}-b_{i^*})^+ |^2]\right\} ^{\frac{1}{2}} \ | \ {\mathbb {S}}_k] \nonumber \\&\overset{\text {Lemma} \ 6 }{\le } |\xi | \ \Vert z_k-z_{k-1}\Vert + \delta \sqrt{\mu _2} \ {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_{k+1}-{\mathcal {P}}(x_{k+1})\Vert \ | \ {\mathbb {S}}_k]. \end{aligned}$$
(60)

Taking expectation in (60) and using (59) along with the tower property, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[\Vert&z_{k+1}- z_{k}\Vert ] \ \le \ |\xi | \ {{\,\mathrm{\mathbb {E}}\,}}[\Vert z_k-z_{k-1}\Vert ] + \delta \sqrt{\mu _2} \ {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_{k+1}-{\mathcal {P}}(x_{k+1})\Vert ] \nonumber \\&\overset{(59)}{\le } |\xi | \left( 1+ \delta \sqrt{\mu _2}\right) {{\,\mathrm{\mathbb {E}}\,}}[\Vert z_k-z_{k-1}\Vert ] + \delta \sqrt{\mu _2 h(\delta )} {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_k-{\mathcal {P}}(x_k)\Vert ]. \end{aligned}$$
(61)

Combining both (59) and (61), we can deduce the following matrix inequality:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\begin{bmatrix} \Vert x_{k+1}-{\mathcal {P}}(x_{k+1})\Vert \\ \Vert z_{k+1}-z_k\Vert \end{bmatrix}&\le \begin{bmatrix} \sqrt{h(\delta )} &{} |\xi | \\ \delta \sqrt{\mu _2 h(\delta )} &{} \ |\xi | \left( 1+ \delta \sqrt{\mu _2}\right) \end{bmatrix} {{\,\mathrm{\mathbb {E}}\,}}\begin{bmatrix} \Vert x_k-{\mathcal {P}}(x_k)\Vert \\ \Vert z_k-z_{k-1}\Vert \end{bmatrix}. \end{aligned}$$
(62)

Now, from the definition, it can be easily checked that \(\varPi _1, \varPi _2, \varPi _3, \varPi _4 \ge 0\). Since, \(\xi \in Q_2\), we have

$$\begin{aligned} \varPi _2 \varPi _3 - \varPi _1 \varPi _4&= |\xi | \delta \sqrt{\mu _2 h(\delta )} - |\xi | \sqrt{h(\delta )} - |\xi | \delta \sqrt{\mu _2 h(\delta )} = - |\xi | \sqrt{h(\delta )} \le 0. \end{aligned}$$
(63)

Also, we have

$$\begin{aligned} \varPi _1 + \varPi _4- \varPi _1 \varPi _4+&\varPi _2 \varPi _3 = \sqrt{h(\delta )} + |\xi | \left( 1+ \delta \sqrt{\mu _2}\right) - |\xi | \sqrt{h(\delta )} < 1. \end{aligned}$$
(64)

Here, in the last inequality we used the given condition. Considering (64), we can check that \(\varPi _1 + \varPi _4 < 1+|\xi | \sqrt{h(\delta )} = 1+ \min \{1, |\xi | \sqrt{h(\delta )}\} = 1+ \min \{1,\varPi _1 \varPi _4-\varPi _2 \varPi _3\}\). Also from (63), we have \(\varPi _2 \varPi _3 - \varPi _1 \varPi _4 \le 0\), which is precisely the condition provided in (25). Let’s define the sequences \(F_k = {{\,\mathrm{\mathbb {E}}\,}}[\Vert z_k-z_{k-1}\Vert ]\) and \(H_k = {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_k-{\mathcal {P}}(x_k)\Vert ]\). Now, using Theorem 2, we have

$$\begin{aligned} \begin{bmatrix} H_{k+1} \\ F_{k+1} \end{bmatrix}&\le \begin{bmatrix} \varGamma _3 (\varGamma _1 \rho _2^{k} -\varGamma _2 \rho _1^{k}) &{} \varGamma _1 \varGamma _2 \varGamma _3 (\rho _1^{k}- \rho _2^{k}) \\ \varGamma _3 (\rho _2^{k}- \rho _1^{k}) &{} \varGamma _3 (\varGamma _1 \rho _1^{k} -\varGamma _2 \rho _2^{k}) \end{bmatrix} \ \begin{bmatrix} H_{1} \\ F_1 \end{bmatrix}. \end{aligned}$$
(65)

where, \(\varGamma _1 , \varGamma _2, \varGamma _3, \rho _1, \rho _2\) can be derived from (26) using the parameter choice of (28). Note that, from the GSKM algorithm we have, \(x_1 = x_0\) and \(z_1 = z_0\).Therefore we can easily check that, \(F_1 = {{\,\mathrm{\mathbb {E}}\,}}[\Vert z_1-z_{0}\Vert ] = 0\) and \(H_1 = {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_1-{\mathcal {P}}(x_1)\Vert ] = {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_0-{\mathcal {P}}(x_0)\Vert ] = \Vert x_0-{\mathcal {P}}(x_0)\Vert = H_0 \). Now, substituting the values of \(H_1\) and \(F_1\) in (65), we have

$$\begin{aligned} \begin{bmatrix} H_{k+1} \\ F_{k+1} \end{bmatrix} = {{\,\mathrm{\mathbb {E}}\,}}\begin{bmatrix} d(x_{k+1}, P) \\ \Vert z_{k+1}-z_k\Vert \end{bmatrix} \le \begin{bmatrix} -\varGamma _2 \varGamma _3 \ \rho _1^{k}+ \varGamma _1 \varGamma _3 \ \rho _2^{k} \\ - \varGamma _3 \ \rho _1^{k}+ \varGamma _3 \ \rho _2^{k} \end{bmatrix} \ d(x_0,P). \end{aligned}$$
(66)

Also from Theorem 2 we have, \(\varGamma _1, \varGamma _3 \ge 0\) and \( 0 \le \rho _1 \le \rho _2 < 1\), which proves the Theorem. \(\square \)

Proof of Theorem 6

Note that, since \(Ax \le b\) is feasible, then from Lemma 12, we know that there is a feasible solution \(x^*\) with \(|x^{*}_j| \le \frac{2^{\sigma }}{2n}\) for \(j = 1, \ldots , n\). Thus, we have

$$\begin{aligned} d(x_0,P) = \Vert x_0-{\mathcal {P}}(x_0)\Vert \ \le \ \Vert x^*\Vert \ \le \ \frac{2^{\sigma -1}}{\sqrt{n}}, \end{aligned}$$
(67)

as \(x_0 = 0\). Then if the system \(Ax \le b\) is infeasible, by using Lemma 10, we have

$$\begin{aligned} \theta (x) \ \ge \ 2^{1-\sigma }. \end{aligned}$$

This implies when GSKM runs on the system \(Ax \le b\), the system is feasible when \(\theta (x) < 2^{1-\sigma }\). Furthermore, since every point of the feasible region P is inside the half-space defined by \(\tilde{H}_i = \{x \ | \ a_i^T x \le b_i\}\) for all \(i = 1,2,\ldots ,m\), we have the following:

$$\begin{aligned} \theta (x) \ = \ \left[ \max _{i}\{a_i^Tx-b_i\}\right] ^{+} \ \le \ \Vert a_i^T(x-{\mathcal {P}}(x))\Vert \ \le \ d(x,P). \end{aligned}$$
(68)

Then, for \(\xi \in Q_1\) whenever the system \(Ax \le b\) is feasible, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ \theta (x_k)\right] \overset{(68)}{\le } {{\,\mathrm{\mathbb {E}}\,}}\left[ d(x_{k},P)\right]&\le \sqrt{{{\,\mathrm{\mathbb {E}}\,}}\left[ d(x_{k},P)^2\right] } \overset{\text {Theorem} \ 3}{\le } \sqrt{1+\phi } \rho ^{\frac{k-1}{2}} \ d(x_0,P). \end{aligned}$$
(69)

Similarly, for \(\xi \in Q_2\) whenever the system \(Ax \le b\) is feasible, we haveFootnote 13

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ \theta (x_k)\right] \overset{(68)}{\le } \ {{\,\mathrm{\mathbb {E}}\,}}\left[ d(x_{k},P)\right]&\overset{\text {Theorem} \ 4}{\le } \sqrt{1+\phi } \ \rho _2^{k-1} \ d(x_0,P). \end{aligned}$$
(70)

Take \({\bar{\rho }} = \max \{\rho , \rho _2^2\}\). Now, combining (69) and (70), for any \(\xi \in Q = Q_1 \cup Q_2\), whenever the system \(Ax \le b\) is feasible, we have

$$ \begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ \theta (x_k)\right] \overset{(69) \ \& \ (70)}{\le } \sqrt{1+\phi } \ {\bar{\rho }}^{\frac{k-1}{2}} \ d(x_0,P) \overset{(67)}{\le } \ \sqrt{1+\phi } \ {\bar{\rho }}^{\frac{k-1}{2}} \ \frac{2^{\sigma -1}}{\sqrt{n}}. \end{aligned}$$
(71)

Here, we used Theorems 3 and 4 and the identities from Eqs. (67) and (68). Now, for detecting feasibility, we need to have, \({{\,\mathrm{\mathbb {E}}\,}}[\theta (x_k)] < 2^{1-\sigma }\). That gives us

$$\begin{aligned} \sqrt{1+\phi } \ {\bar{\rho }}^{\frac{k-1}{2}} \ \frac{2^{\sigma -1}}{\sqrt{n}} < 2^{1-\sigma }. \end{aligned}$$

Simplifying the above identity further, we get the following lower bound for k:

$$\begin{aligned} k-1 \ > \ \frac{4 \sigma - 4 -\log n + \log (1+\phi )}{\log \left( \frac{1}{{\bar{\rho }}}\right) }. \end{aligned}$$

Moreover, if the system \(Ax \le b\) is feasible, then the probability of not having a certificate of feasibility is bounded as follows:

$$\begin{aligned} p = {\mathbb {P}} \left( \theta (x_k) \ge 2^{1-\sigma }\right) \ \le \ \frac{{{\,\mathrm{\mathbb {E}}\,}}\left[ \theta (x_k)\right] }{2^{1-\sigma }} \ < \ \sqrt{\frac{1+\phi }{n}} \ 2^{2\sigma -2} \ {\bar{\rho }}^{\frac{k-1}{2}}. \end{aligned}$$

Here, we used the Markov’s inequality \({\mathbb {P}} (x \ge t) \le \frac{{{\,\mathrm{\mathbb {E}}\,}}[x]}{t}\). This completes the proof of Theorem 6. \(\square \)

Appendix 3

Proof of Theorem 7

From the update formula of the PASKM algorithm, we get

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [\Vert v_{k+1}- {\mathcal {P}}(v_{k+1}) \Vert ^2] \nonumber \\&\quad \overset{\text {Lemma} \ 3}{ \le }\ \ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [\Vert v_{k+1}- \omega {\mathcal {P}}(v_{k})-(1-\omega ) {\mathcal {P}}(y_{k}) \Vert ^2] \nonumber \\&\quad = {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [\Vert \omega (v_{k}- {\mathcal {P}}(v_{k})) + (1-\omega ) (y_{k}- {\mathcal {P}}(y_{k})) - \gamma \left( a_{i^*}^Ty_{k}-b_{i^*}\right) ^+ a_{i^*}\Vert ^2] \nonumber \\&\quad = {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [\Vert \omega (v_{k}- {\mathcal {P}}(v_{k})) + (1-\omega ) (y_{k}- {\mathcal {P}}(y_{k}))\Vert ^2] + \gamma ^2 {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [|(a_{i^*}^Ty_{k}-b_{i^*})^+|^2] \nonumber \\&\quad - 2 \gamma (1-\omega ) \big \langle y_k-{\mathcal {P}}(y_{k}), {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [(a_{i^*}^Ty_{k}-b_{i^*})^+ a_{i^*}] \big \rangle \nonumber \\&\quad - 2 \gamma \omega \big \langle v_k-{\mathcal {P}}(v_{k}), {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [(a_{i^*}^Ty_{k}-b_{i^*})^+ a_{i^*}] \big \rangle \nonumber \\&\quad \le \omega \Vert v_{k}- {\mathcal {P}}(v_{k}) \Vert ^2 + (1-\omega ) \Vert y_{k}- {\mathcal {P}}(y_{k}) \Vert ^2 + \gamma ^2 {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [|(a_{i^*}^Ty_{k}-b_{i^*})^+|^2] \nonumber \\&\quad - 2 \gamma (1-\omega ) {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [|(a_{i^*}^Ty_{k}-b_{i^*})^+|^2] + \omega \gamma {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [|(a_{i^*}^Ty_{k}-b_{i^*})^+|^2] + \omega \gamma \Vert v_{k}- {\mathcal {P}}(v_{k}) \Vert ^2 \nonumber \\&\quad = \omega (1+\gamma ) \Vert v_{k}- {\mathcal {P}}(v_{k}) \Vert ^2 + (1-\omega ) \Vert y_{k}- {\mathcal {P}}(y_{k}) \Vert ^2 + 2\gamma (\gamma + 3 \omega -2) f(y_k) \nonumber \\&\quad \le \omega (1+\gamma ) \Vert v_{k}- {\mathcal {P}}(v_{k}) \Vert ^2 + \left\{ 1-\omega +\gamma \mu _1 (\gamma + 3 \omega -2)\right\} \Vert y_{k}- {\mathcal {P}}(y_{k}) \Vert ^2. \end{aligned}$$
(72)

Here, we used the condition \(\gamma + 3 \omega -2 \le 0\). Similarly, using the update formula for \(y_{k+1}\), we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}}[\Vert&y_{k+1}- {\mathcal {P}}(y_{k+1}) \Vert ^2] \nonumber \\&\overset{\text {Lemma} \ 3}{ \le }\ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}}[\Vert \alpha (v_{k+1}- {\mathcal {P}}(v_{k+1})) + (1-\alpha ) (x_{k+1} - {\mathcal {P}}(y_{k})) \Vert ^2] \nonumber \\&\le \alpha {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}}[\Vert v_{k+1}- {\mathcal {P}}(v_{k+1})\Vert ^2] + (1-\alpha ) {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}}[\Vert x_{k+1} - {\mathcal {P}}(y_{k})\Vert ^2] \nonumber \\&\overset{\text {Lemma} \ 9}{ \le }\ \alpha {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}}[\Vert v_{k+1}- {\mathcal {P}}(v_{k+1})\Vert ^2] + (1-\alpha ) h(\delta ) \Vert y_{k} - {\mathcal {P}}(y_{k})\Vert ^2. \end{aligned}$$
(73)

Following Theorem 2, let us define the sequences \(H_k = E [\Vert v_k-{\mathcal {P}}(v_k)\Vert ^2]\) and \(F_k = {{\,\mathrm{\mathbb {E}}\,}}[\Vert y_k-{\mathcal {P}}(y_k)\Vert ^2]\). The goal is to prove that \(H_k\) and \(F_k\) satisfy the condition (25). Now, taking expectation in (72) and using the tower property of expectation, we have

$$\begin{aligned} H_{k+1} \&\le \omega (1+\gamma ) H_k + \left\{ 1-\omega +\gamma \mu _1 (\gamma + 3 \omega -2)\right\} F_k. \end{aligned}$$
(74)

Similarly, taking expectation in (73) and using (74) along with the tower property of expectation, we have

$$\begin{aligned}&F_{k+1} \ \le \alpha H_{k+1} + (1-\alpha ) h(\delta ) F_k \nonumber \\&\quad \le \ \alpha \omega (1+\gamma ) H_k + \{ (1-\alpha ) h(\delta ) + \alpha (1-\omega ) + \alpha \gamma \mu _1 (\gamma + 3 \omega -2)\} F_k. \end{aligned}$$
(75)

Combining both (74) and (75), we can deduce the following matrix inequality:

$$\begin{aligned} \begin{bmatrix} H_{k+1} \\ F_{k+1} \end{bmatrix}&\le \begin{bmatrix} \varPi _1 &{} \varPi _2 \\ \varPi _3 &{} \ \varPi _4 \end{bmatrix} \begin{bmatrix} H_{k} \\ F_{k} \end{bmatrix} \le \begin{bmatrix} \varPi _1 &{} \varPi _2 \\ \varPi _3 &{} \ \varPi _4 \end{bmatrix}^{k+1} \begin{bmatrix} H_{0} \\ F_{0} \end{bmatrix}. \end{aligned}$$
(76)

Here, we use the fact that \(\varPi _1, \varPi _2, \varPi _3, \varPi _4 \ge 0\). Now we will use Theorem 2 to simplify the expression of (76). Before we can use Theorem 2, we need to make sure the sequences \(H_k\) and \(F_k\) satisfy the condition of (25). From the definition, we have

$$\begin{aligned} \varPi _2&\varPi _3 - \varPi _1 \varPi _4 = \alpha \omega (1-\omega )(1+\gamma ) + \alpha \omega \gamma \mu _1 (1+\gamma )(\gamma +3w-2) \nonumber \\&- \omega h(\delta ) (1-\alpha ) (1+\gamma ) -\alpha \omega (1-\omega )(1+\gamma ) - \alpha \omega \gamma \mu _1 (1+\gamma )(\gamma +3w-2) \nonumber \\&\quad = - \omega h(\delta ) (1-\alpha ) (1+\gamma ) \le 0. \end{aligned}$$
(77)

Also, we have

$$\begin{aligned} \varPi _1 + \varPi _4- \varPi _1 \varPi _4+&\varPi _2 \varPi _3 = \omega (1+\gamma ) + h(\delta ) (1-\alpha ) + \alpha (1-\omega ) \nonumber \\&+ \alpha \gamma \mu _1 (\gamma +3w-2)- \omega h(\delta ) (1-\alpha ) (1+\gamma ) < 1. \end{aligned}$$
(78)

Here, in the last inequality, we used the given condition. Considering (78), we can check that \(\varPi _1 + \varPi _4 < 1+\omega h(\delta ) (1-\alpha ) (1+\gamma ) = 1+ \min \{1, \omega h(\delta ) (1-\alpha ) (1+\gamma )\} = 1+ \min \{1,\varPi _1 \varPi _4-\varPi _2 \varPi _3\}\). Also from (77), we have \(\varPi _2 \varPi _3 - \varPi _1 \varPi _4 \le 0\), which is precisely the condition provided in (25). Now, using Theorem 2, we have

$$\begin{aligned} \begin{bmatrix} H_{k+1} \\ F_{k+1} \end{bmatrix}&\le \begin{bmatrix} \varGamma _3 (\varGamma _1 \rho _2^{k+1} -\varGamma _2 \rho _1^{k+1}) \ \ &{} \ \ \varGamma _1 \varGamma _2 \varGamma _3 (\rho _1^{k+1}- \rho _2^{k+1}) \\ \varGamma _3 (\rho _2^{k+1}- \rho _1^{k+1}) \ &{} \ \varGamma _3 (\varGamma _1 \rho _1^{k+1} -\varGamma _2 \rho _2^{k+1}) \end{bmatrix} \ \begin{bmatrix} H_{0} \\ F_0 \end{bmatrix}. \end{aligned}$$
(79)

where, \(\varGamma _1 , \varGamma _2, \varGamma _3, \rho _1, \rho _2\) can be derived from (26) using the given parameter. Note that, from the PASKM algorithm we have, \(x_0 = v_0 = y_0\). Therefore we can easily check that, \( H_0 = \Vert v_0-{\mathcal {P}}(v_0)\Vert ^2 = \Vert y_0-{\mathcal {P}}(y_0)\Vert ^2 = F_0 \). Now, substituting the values of \(H_0\) and \(F_0\) in (79), we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\begin{bmatrix} d(v_{k+1}, P)^2 \\ d(y_{k+1}, P)^2 \end{bmatrix} \le \begin{bmatrix} \varGamma _2 \varGamma _3 (\varGamma _1-1) \ \rho _1^{k+1}+ \varGamma _1 \varGamma _3 (1-\varGamma _2)\ \rho _2^{k+1} \\ \varGamma _3 (\varGamma _1-1) \ \rho _1^{k+1}+ \varGamma _3 (1-\varGamma _2)\ \rho _2^{k+1} \end{bmatrix} d(y_0,P)^2. \end{aligned}$$
(80)

Also from Theorem 2 we have, \(\varGamma _1, \varGamma _3 \ge 0\) and \( 0 \le \rho _1 \le \rho _2 < 1\), which proves the the first part of the Theorem. Now, considering Lemma 6, we get

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[ f(x_{k+1})] \ \le \ \frac{\mu _2}{2} \ {{\,\mathrm{\mathbb {E}}\,}}[\Vert y_{k+1}-{\mathcal {P}}(y_{k+1})\Vert ^2] = \ \frac{\mu _2}{2} \ {{\,\mathrm{\mathbb {E}}\,}}[d(y_{k+1},P)^2]. \end{aligned}$$
(81)

Now, substituting the result of (80) in (81), we get the second part of the Theorem. \(\square \)

Proof of Theorem 8

Let us define, \({\mathcal {V}} = \omega {\mathcal {P}}(v_{k}) +(1-\omega ) {\mathcal {P}}(y_{k}) \). Since \({\mathcal {V}} \in P\), using the update formula of \(v_{k+1}\) from Eq. (16), we have,

$$\begin{aligned} d(v_{k+1},P)^2&= \Vert v_{k+1} - {\mathcal {P}}(v_{k+1})\Vert ^2 \overset{\text {Lemma} \ 3}{\le } \Vert v_{k+1} - {\mathcal {V}}\Vert ^2 \nonumber \\&\overset{(16)}{=} \big \Vert \omega v_k + (1-\omega ) y_k- {\mathcal {V}} -\gamma (a_{i^*}^Ty_k- b_{i^*})^+ a_{i^*} \big \Vert ^2 \nonumber \\&= \underbrace{ \Vert \omega v_k + (1-\omega ) y_k- {\mathcal {V}} \Vert ^2}_{I_1} + \gamma ^2 \underbrace{\Vert (a_{i^*}^Ty_k- b_{i^*})^+a_{i^*}\Vert ^2}_{I_2} \nonumber \\&\quad - 2 \gamma \underbrace{\big \langle \omega v_k + (1-\omega ) y_k- {\mathcal {V}} \ ,\ a_{i^*} (a_{i^*}^Ty_k-b_{i^{*}})^{+} \big \rangle }_{I_3} \nonumber \\&= I_1 + \gamma ^2 I_2 -2 \gamma I_3. \end{aligned}$$
(82)

Since \(\Vert \cdot \Vert ^{2}\) is a convex function and \(0< \omega < 1\), we can bound the expected first term as follows:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} [I_1]&= {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ \Vert \omega v_k + (1-\omega ) y_k-{\mathcal {V}}\Vert ^2\right] \nonumber \\&= {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ \Vert \omega v_k + (1-\omega ) y_k- \omega {\mathcal {P}}(v_{k}) - (1-\omega ) {\mathcal {P}}(y_{k})\Vert ^2\right] \nonumber \\&\le \omega \Vert v_k-{\mathcal {P}}(v_{k})\Vert ^2 + (1-\omega ) \Vert y_k-{\mathcal {P}}(y_{k})\Vert ^2\nonumber \\&= \omega \ d(v_k,P)^2 + (1-\omega ) \ d(y_k,P)^2. \end{aligned}$$
(83)

Taking expectation with respect to the sampling distribution in the second term of Eq. (82) and using Lemma 9 with the choice \(z = x_{k+1}, \ x = y_k\) and \(\eta = 2\delta -\delta ^2\), we get

$$\begin{aligned} \gamma ^2 {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ \Vert (a_{i^*}^Ty_k- b_{i^*})^+ a_{i^*}\Vert ^2\right]&\ = \ \gamma ^2 {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ |(a_{i^*}^Ty_k- b_{i^*})^+ |^2\right] \nonumber \\ \overset{\text {Lemma} \ 9}{\le }&\ \frac{\gamma ^2 }{\eta } \left[ d(y_k,P)^2 - {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ d(x_{k+1},P)^2 \right] \right] . \end{aligned}$$
(84)

Now, taking expectation in the third term of (82) we get

$$ \begin{aligned} - 2 \gamma {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k}&[I_3 ] = - 2 \gamma \big \langle \omega v_k + (1-\omega ) y_k-{\mathcal {V}}, {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ a_{i^*} (a_{i^*}^Ty_k-b_{i^*})^{+}\right] \big \rangle \nonumber \\&\overset{(14) \ \& \ (9) }{=} - 2 \gamma \big \langle \frac{\omega }{\alpha } \left[ y_k - (1-\alpha ) x_k\right] + (1-\omega ) y_k-{\mathcal {V}}, g(y_k) \big \rangle \nonumber \\&\qquad = - 2 \gamma \big \langle \frac{\omega (1-\alpha )}{\alpha }(y_k-x_k)+ y_k-{\mathcal {V}}, g(y_k) \big \rangle . \end{aligned}$$
(85)

Using Lemma 7 and Lemma 8 we can simplify Eq. (85) as follows:

$$ \begin{aligned}&\ - 2 \gamma {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} [I_3 ] = - 2 \gamma \big \langle \frac{\omega (1-\alpha )}{\alpha }(y_k-x_k)+ y_k-{\mathcal {V}}, g(y_k) \big \rangle \nonumber \\&\quad = 2 \gamma \frac{\omega (1-\alpha )}{\alpha } \big \langle x_k-y_k, g(y_k) \big \rangle + 2 \gamma \big \langle {\mathcal {V}}-y_k, g(y_k) \big \rangle \nonumber \\&\overset{\text {Lemma} \ 7 \ \& \ 8}{\le } \frac{\gamma \omega (1-\alpha )}{\alpha } \ d(x_k,P)^2 - \frac{ \mu _1 \gamma \omega (1-\alpha )}{\alpha } \ d(y_k,P)^2 - 2 \mu _1 \gamma \ d(y_k,P)^2. \end{aligned}$$
(86)

Now, substituting the values of Eqs. (83), (84) and (86) in Eq. (82) we get the following:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} [&d(v_{k+1} ,P)^2 ] = I_1 + \gamma ^2 {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k}[I_2 ]- 2 \gamma {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k}[I_3]\\&= \omega \ d(v_{k},P)^2 + (1-\omega ) \ d(y_k,P)^2 + \frac{\gamma ^2 }{\eta } \left\{ d(y_k,P)^2 - {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ d(x_{k+1},P)^2 \right] \right\} \\&\quad + \frac{\gamma \omega (1-\alpha )}{\alpha } d(x_k,P)^2 - \gamma \mu _1 \left( 2+\frac{ \omega (1-\alpha )}{\alpha }\right) d(y_k,P)^2. \end{aligned}$$

With further simplification, the above identity can be written as follows:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ d(v_{k+1},P)^2 + \frac{\gamma ^2}{\eta } d(x_{k+1},P)^2 \right] = \omega \ \left[ d(v_{k},P)^2 + \frac{\gamma (1-\alpha )}{\alpha } d(x_{k},P)^2 \right] \nonumber \\&\quad \quad \quad \quad \quad \quad \quad + d(y_k,P)^2 \ \left\{ 1-\omega + \frac{\gamma ^2 }{\eta } - 2\gamma \mu _1 - \frac{\gamma \omega \mu _1 (1-\alpha )}{\alpha } \right\} . \end{aligned}$$
(87)

Now, let’s choose the parameters as in Eq. (33) along with \(0< \zeta < \frac{4\eta \mu _1}{(1-\mu _1)^2} \). We can easily see that \(\frac{\gamma ^2 }{\eta } = \frac{ \gamma (1-\alpha )}{\alpha }\) and \( \alpha \in (0,1)\). Also note that

$$\begin{aligned} 2 \mu _1 \gamma \ = \ 2 \mu _1 \sqrt{\eta \zeta \mu _1} \ > \ 2 \mu _1 \sqrt{\frac{\zeta (1-\mu _1)^2 \zeta }{4}} = \mu _1 \zeta (1-\mu _1), \end{aligned}$$
(88)

which implies \(\omega < 1\). Similarly, whenever \(\mu _1 < 1\) we have

$$\begin{aligned} 2\gamma - \zeta - \frac{1}{\mu _1} < 2\sqrt{\eta \mu _1 \zeta } - \zeta - 1 \le 2 \sqrt{\zeta } - \zeta - 1 = -(\sqrt{\zeta }-1)^2 \le 0, \end{aligned}$$
(89)

which implies \(\omega > 0\). Also, using the parameter choice of (33), we have

$$\begin{aligned} 1-\omega + \frac{\gamma ^2 }{\eta } - 2\gamma \mu _1 -&\frac{\gamma \omega \mu _1 (1-\alpha )}{\alpha } = 1-\omega + \zeta \mu _1 - 2\gamma \mu _1 - \omega \zeta \mu _1^2 \nonumber \\&= 1- 2\gamma \mu _1 + \zeta \mu _1 - \omega (1+\zeta \mu _1^2) = 0. \end{aligned}$$
(90)

Now, using all of the above relations (Eqs. (33), (90)) in Eq. (87), we get the following:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ d(v_{k+1},P)^2 + \frac{\gamma ^2}{\eta } d(x_{k+1},P)^2 \right]&\ \le \ \omega \Big [d(v_{k},P)^2 + \underbrace{\frac{\gamma (1-\alpha )}{\alpha }}_{= \frac{\gamma ^2 }{\eta }} \ d(x_{k},P)^2 \Big ] \nonumber \\&\quad + \Big [\underbrace{1-\omega + \frac{\gamma ^2 }{\eta } - 2\gamma \mu _1 - \frac{\gamma \omega \mu _1 (1-\alpha )}{\alpha } }_{= \ 0} \Big ] \ d(y_k,P)^2 \nonumber \\&= \omega \ \left[ d(v_{k},P)^2+ \frac{\gamma ^2}{\eta } d(x_{k},P)^2 \right] . \end{aligned}$$
(91)

Finally, taking expectation again with tower rule and substituting \(\frac{\gamma ^2 }{\eta }= \zeta \mu _1 \) we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ d(v_{k+1},P)^2 + \zeta \mu _1 \ d(x_{k+1},P)^2 \right]&\le \ \omega ^{k+1} \ {{\,\mathrm{\mathbb {E}}\,}}\left[ d(v_{0},P)^2+ \zeta \mu _1 \ d(x_{0},P)^2 \right] \\&= (1+\zeta \mu _1) \ \omega ^{k+1} \ d(x_{0},P)^2. \end{aligned}$$

This proves the Theorem. Furthermore, for faster convergence, we need to choose parameters such that, \(\omega \) becomes as small as possible. In the proof, we assumed \(\mu _1 <1\) holds which is the most probable scenario. Whenever \(\mu _1 = 1\), we must have \(\mu _1 = \mu _2 =1\) and Lemma 6 holds with both equality, i.e., \(f(x) = d(x,P)^2\). Therefore if we choose \(\alpha , \gamma , \omega \) as \(\alpha = \frac{\eta }{\eta + \gamma }, \ \gamma = \sqrt{\zeta \eta }, \ \omega = 1- \frac{2\gamma }{1+\zeta }\), we can check that condition (89) holds and \(0< \omega < 1\) holds for any \(\zeta > 0\). \(\square \)

Proof of Theorem 9

For any natural number \(l \ge 1\), using the update formula of \(v_{l+1}\), we have

$$\begin{aligned} v_{l+1}&= \omega v_l +(1-\omega ) y_l - \gamma (a_{i^*}^Ty_{l}-b_{i^*})^+a_{i^*} \nonumber \\&\overset{(14)}{=} \ \left( 1-\omega + \frac{\omega }{\alpha }\right) y_l - \frac{\omega (1-\alpha )}{\alpha } x_l - \gamma (a_{i^*}^Ty_{l}-b_{i^*})^+a_{i^*}. \end{aligned}$$
(92)

Let \(\varphi = \omega (1-\alpha )\). It can be easily checked that \(0 \le \varphi < 1\). Now, considering Eq. (14), we have

$$ \begin{aligned}&y_{l+1} = \alpha v_{l+1} + (1-\alpha ) x_{l+1} \nonumber \\&\quad \overset{(15) \ \& \ (92) }{=} \ (1+\omega -\alpha \omega ) y_l - \omega (1-\alpha ) y_{l-1} + \omega \delta (1-\alpha ) (a_{j^*}^Ty_{l-1}-b_{j^*})^+a_{j^*} \nonumber \\&\quad \quad \quad \quad \quad \quad - [\alpha \gamma + (1-\alpha ) \delta ] \ (a_{i^*}^Ty_{l}-b_{i^*})^+a_{i^*} \nonumber \\&\quad = (1+\omega -\alpha \omega ) y_l - \omega (1-\alpha ) y_{l-1} + \omega \delta (1-\alpha ) (a_{j^*}^Ty_{l-1}-b_{j^*})^+a_{j^*} \nonumber \\&\quad \quad \quad \quad \quad \quad - \delta (1+\omega - \alpha \omega ) \ (a_{i^*}^Ty_{l}-b_{i^*})^+a_{i^*} \nonumber \\&\quad = (1+\varphi ) y_l - \varphi y_{l-1} + \delta \varphi (a_{j^*}^Ty_{l-1}-b_{j^*})^+a_{j^*} - \delta (1+\varphi ) \ (a_{i^*}^Ty_{l}-b_{i^*})^+a_{i^*}. \end{aligned}$$
(93)

here, the index \(i^{*}\) and \(j^*\) are defined based on (6) respectively for the sequences \(y_l\) and \(y_{l-1}\). Furthermore, with the choice of \(x_0 = v_0\), the points \(y_0\) and \(y_1\) generated by the PASKM method (i.e, algorithm 3 with arbitrary parameter choice) can be calculated as

$$\begin{aligned} y_1 = \alpha v_1 + (1-\alpha ) x_1&= x_0 - (\alpha \gamma + \delta (1-\alpha )) (a_{i^*}^Ty_{0}-b_{i^*})^+a_{i^*} \nonumber \\&= y_0 - \delta (1+\varphi ) (a_{i^*}^Ty_{0}-b_{i^*})^+a_{i^*}, \end{aligned}$$
(94)

since \(y_0 = x_0 = v_0\). Now, let’s define, \({\bar{\vartheta }}_l = \frac{\varphi }{1-\varphi }[y_{l}-y_{l-1}+\delta (a_{j^*}^Ty_{l-1}-b_{j^*})^+a_{j^*}]\), \( {\bar{\varDelta }}_l = y_l + {\bar{\vartheta }}_l\) and \( {\bar{\chi }}_l = \Vert y_l+{\bar{\vartheta }}_l-{\mathcal {P}}({\bar{\varDelta }}_l)\Vert ^2\), then using the update formula (93), we have

$$\begin{aligned} y_{l+1} + {\bar{\vartheta }}_{l+1}&= y_{l+1} +\frac{\varphi }{1-\varphi }[y_{l+1}-y_{l}+\delta (a_{i^*}^Ty_{l}-b_{i^*})^+a_{i^*}] \\&= \frac{1}{1-\varphi } y_{l+1} - \frac{\varphi }{1-\varphi } y_{l}+ \frac{\delta \varphi }{1-\varphi } (a_{i^*}^Ty_{l}-b_{i^*})^+a_{i^*} \\&\overset{(93)}{=} y_l + {\bar{\vartheta }}_l - \frac{\delta }{1-\varphi } \left( a_{i^*}^Ty_l-b_{i^*}\right) ^+ a_{i^*}. \end{aligned}$$

Using the above relation, we can write

$$\begin{aligned} {\bar{\chi }}_{l+1}&= \Vert y_{l+1}+{\bar{\vartheta }}_{l+1}-{\mathcal {P}}({\bar{\varDelta }}_{l+1})\Vert ^2 \overset{\text {Lemma} \ 3}{ \le } \Vert y_{l+1}+{\bar{\vartheta }}_{l+1}-{\mathcal {P}}({\bar{\varDelta }}_l)\Vert ^2 \nonumber \\&= \big \Vert y_l + {\bar{\vartheta }}_l - \frac{\delta }{1-\varphi } \left( a_{i^*}^Ty_l-b_{i^*}\right) ^+ a_{i^*}-{\mathcal {P}}({\bar{\varDelta }}_l) \big \Vert ^2 \nonumber \\&= \underbrace{\Vert y_l+{\bar{\vartheta }}_l-{\mathcal {P}}({\bar{\varDelta }}_l)\Vert ^2}_{= {\bar{\chi }}_l} + \frac{\delta ^2}{(1-\varphi )^2} \underbrace{\Vert (a_{i^*}^Ty_l- b_{i^*})^+a_{i^*}\Vert ^2}_{I_1} \nonumber \\&\quad - \frac{2 \delta }{1-\varphi } \underbrace{\big \langle y_l+{\bar{\vartheta }}_l-{\mathcal {P}}({\bar{\varDelta }}_l) \ ,\ a_{i^*} (a_{i^*}^Ty_l-b_{i^{*}})^{+} \big \rangle }_{ I_2} \nonumber \\&= {\bar{\chi }}_l + \frac{\delta ^2}{(1-\varphi )^2} I_1- \frac{2 \delta }{1-\varphi } I_2. \end{aligned}$$
(95)

Taking expectation with respect to \({\mathbb {S}}_l\) we have,

$$\begin{aligned} \frac{\delta ^2}{(1-\varphi )^2} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_l} [I_1] \overset{(9) }{=} \frac{2\delta ^2}{(1-\varphi )^2} f(y_l). \end{aligned}$$
(96)

Similarly, we can simplify the third term of (95) as

$$ \begin{aligned}&- \frac{2\delta }{1-\varphi } {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_l} [I_2] \nonumber \\&\quad \overset{(9) }{=} -\frac{2\delta }{1-\varphi } \big \langle y_l-{\mathcal {P}}({\bar{\varDelta }}_l) , g(y_l) \big \rangle + \frac{2\delta \varphi }{(1-\varphi )^2} \big \langle y_{l-1}-y_l - \delta g(y_{l-1}) , g(y_l) \big \rangle \nonumber \\&\quad = -\frac{2\delta }{1-\varphi } \big \langle y_l-{\mathcal {P}}({\bar{\varDelta }}_l) , g(y_l) \big \rangle + \frac{2\delta \varphi }{(1-\varphi )^2} \big \langle y_{l-1}-y_l , g(y_l) \big \rangle \nonumber \\&\quad \quad \quad \quad \quad \quad -\frac{\delta ^2 \varphi }{(1-\varphi )^2} \left[ \Vert g(y_l)+g(y_{l-1})\Vert ^2-\Vert g(y_l)\Vert ^2-\Vert g(y_{l-1})\Vert ^2 \right] \nonumber \\&\quad \overset{\text {Lemma} \ 7 \ \& \ 8}{\le } - \frac{4\delta }{1-\varphi } f(y_l) + \frac{2\delta \varphi }{(1-\varphi )^2} \left[ f(y_{l-1}) - f(y_{l}) \right] + \frac{2\delta ^2 \varphi }{(1-\varphi )^2} \left[ f(y_{l-1}) + f(y_{l}) \right] \nonumber \\&\quad = \frac{2 \delta \varphi (1+\delta )}{(1-\varphi )^2} f(y_{l-1}) -\frac{2 \delta \varphi (1+\delta )}{(1-\varphi )^2} f(y_{l}) + \frac{4 \delta (\varphi + \delta \varphi -1)}{(1-\varphi )^2} f(y_{l}). \end{aligned}$$
(97)

Using the expressions of Eqs. (96) and (97) in (95) and simplifying further, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_l}[{\bar{\chi }}_{l+1}] + \frac{2\delta \varphi (1+\delta )}{(1-\varphi )^2} f(y_l) + \varsigma f(y_l) \ \le \ {\bar{\chi }}_l + \frac{2\delta \varphi (1+\delta )}{(1-\varphi )^2} f(y_{l-1}), \end{aligned}$$
(98)

here, the defined constant \(\varsigma \) as,

$$\begin{aligned} \varsigma = \frac{4 \delta (1-\varphi - \delta \varphi )}{(1-\varphi )^2} - \frac{2 \delta ^2 }{(1-\varphi )^2} = \frac{2 \delta (2-2 \varphi - 2\delta \varphi -\delta )}{(1-\varphi )^2} \ > \ 0. \end{aligned}$$
(99)

Now, taking expectation again in (98) and using the tower property, we get

$$\begin{aligned} {\bar{q}}_{l+1} + \varsigma {{\,\mathrm{\mathbb {E}}\,}}[f(y_l)] \le {\bar{q}}_l, \quad l = 1,2,3\ldots , \end{aligned}$$
(100)

where, \({\bar{q}}_l = {{\,\mathrm{\mathbb {E}}\,}}[{\bar{\chi }}_l] + \frac{2\delta \varphi (1+\delta )}{(1-\varphi )^2} {{\,\mathrm{\mathbb {E}}\,}}[f(y_{l-1})]\). Summing up (100) for \(l=1,2,\ldots ,k\) we get

$$\begin{aligned} \sum \limits _{l=1}^{k} {{\,\mathrm{\mathbb {E}}\,}}[f(y_l)] \ \le \ \frac{{\bar{q}}_1-{\bar{q}}_{k+1}}{\varsigma } \ \le \ \frac{{\bar{q}}_1}{\varsigma }. \end{aligned}$$
(101)

Now, using Jensen’s inequality, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ f(\bar{y_k})\right] = {{\,\mathrm{\mathbb {E}}\,}}\left[ f\left( \sum \limits _{l=1}^{k} \frac{y_k}{k}\right) \right] \ \le \ {{\,\mathrm{\mathbb {E}}\,}}\left[ \frac{1}{k} \sum \limits _{l=1}^{k}f(y_l)\right] \ = \ \frac{1}{k} \sum \limits _{l=1}^{k} {{\,\mathrm{\mathbb {E}}\,}}[f(y_l)] \ \overset{(101)}{\le } \frac{{\bar{q}}_1}{\varsigma k}. \end{aligned}$$

From (94), \(y_1 = y_0 - \delta (1+\varphi ) (a_{i^*}^Ty_0-b_{i^*})^+ a_{i^*}\) and \({\bar{\vartheta }}_1 = \frac{-\varphi ^2 \delta }{1-\varphi } (a_{i^*}^Ty_0-b_{i^*})^+ a_{i^*} \). Then,

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[{\bar{\chi }}_1]&= {{\,\mathrm{\mathbb {E}}\,}}\left[ \Vert y_1+{\bar{\vartheta }}_1 -{\mathcal {P}}({\bar{\varDelta }}_1)\Vert ^2 \right] \overset{\text {Lemma} \ 3}{ \le } {{\,\mathrm{\mathbb {E}}\,}}\left[ \Vert y_1+{\bar{\vartheta }}_1 -{\mathcal {P}}(y_0)\Vert ^2 \right] \nonumber \\&= {{\,\mathrm{\mathbb {E}}\,}}\left[ \Vert y_0 -{\mathcal {P}}(y_0)-\frac{\delta }{1-\varphi } (a_{i^*}^Ty_0-b_{i^*})^+ a_{i^*} \Vert ^2 \right] \nonumber \\&= \Vert y_0 -{\mathcal {P}}(y_0)\Vert ^2 + \frac{\delta ^2}{(1-\varphi )^2} {{\,\mathrm{\mathbb {E}}\,}}[|(a_{i^*}^Ty_0-b_{i^*})^+|^2] \nonumber \\&\quad \quad \quad \quad \quad \quad - \frac{2\delta }{1-\varphi } \langle y_0 -{\mathcal {P}}(y_0) , {{\,\mathrm{\mathbb {E}}\,}}[(a_{i^*}^Ty_0-b_{i^*})^+ a_{i^*}] \rangle \nonumber \\&\overset{\text {Lemma} \ 8 }{\le } \Vert y_0 -{\mathcal {P}}(y_0)\Vert ^2 + \frac{2\delta ^2}{(1-\varphi )^2} f(y_0)-\frac{4\delta }{1-\varphi } f(y_0). \end{aligned}$$
(102)

Now, from our construction we get

$$\begin{aligned} {\bar{q}}_1 = {{\,\mathrm{\mathbb {E}}\,}}[{\bar{\chi }}_1] + \frac{2\delta \varphi (1+\delta )}{(1-\varphi )^2} {{\,\mathrm{\mathbb {E}}\,}}[f(y_{0})] \le d(y_0,P)^2+ \frac{2\delta (\delta -2+ 3\varphi + \delta \varphi )}{(1-\varphi )^2} f(y_0). \end{aligned}$$

Substituting the values of \(\varsigma \) and \(q_1\) in the expression of \({{\,\mathrm{\mathbb {E}}\,}}\left[ f(\bar{y_k})\right] \), we have the following:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ f({\bar{y}}_k)\right] \le \frac{ (1-\omega + \alpha \omega )^2 \ d(y_0,P)^2+ 2\delta (\delta -2+ 3 \omega - 3 \alpha \omega + \delta \omega - \delta \alpha \omega ) f(y_0)}{2 \delta k \left( 2-2 \omega + 2 \alpha \omega - 2\delta \omega + 2 \delta \alpha \omega -\delta \right) }. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Morshed, M.S., Islam, M.S. & Noor-E-Alam, M. Sampling Kaczmarz-Motzkin method for linear feasibility problems: generalization and acceleration. Math. Program. 194, 719–779 (2022). https://doi.org/10.1007/s10107-021-01649-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-021-01649-8

Keywords

Mathematics Subject Classification

Navigation