From the projected stochastic subgradient method
\( w_{t+1} = \Pi _D(w_t - A_tg_t)\), we have:
$$\begin{aligned}{} & {} ||w_{t+1} - w^*||^2 \le ||w_t - w^*-A_tg_t||^2\\{} & {} \quad = ||w_t - w^*||^2 - 2A_t \langle g_t, w_t-w^*\rangle + A_t^2 ||g_t||^2 \end{aligned}$$
From Lemma
3 we have:
$$\begin{aligned} \le ||w_t - w^*||^2 - 2A_t (f_t(w_t) - f_t(w^*)) + A_t^2 ||g_t||^2 \end{aligned}$$
Rearranging this we get:
$$\begin{aligned}{} & {} 2A_t(f_t(w_t) - f_t(w^*)) \le ||w_t - w^*||^2 - ||w_{t+1} - w^*|| + A_t^2 ||g_t||^2\\{} & {} \quad f_t(w_t) - f_t(w^*) \le \frac{1}{2A_t} ||w_t - w^*||^2 - \\{} & {} \quad - \frac{1}{2A_t} ||w_{t+1} - w^*|| + \frac{A_t}{2} ||g_t||^2 \end{aligned}$$
Summing from t=1,2,..,T to form the Regret, we have:
$$\begin{aligned}{} & {} R(T) = \sum _{t=1}^T \left( f_t(w_t) - f_t(w^*) \right) \le \\{} & {} \quad \le \sum _{t=1}^T \frac{1}{2A_t} ||w_t - w^*||^2 - \sum _{t=1}^T\frac{1}{2A_t} ||w_{t+1} - w^*|| + \\{} & {} \quad + \sum _{t=1}^T\frac{A_t}{2} ||g_t||^2 \end{aligned}$$
Unfolding the sums we get:
$$\begin{aligned}{} & {} \le \left( \frac{1}{2A_1}||w_1 - w^*||^2 + \frac{1}{2A_2}||w_2 - w^*||^2\right. + ... \\{} & {} \quad + \left. \frac{1}{2A_T}||w_T - w^*||^2 \right) - \\{} & {} \quad - \left( \frac{1}{2A_1}||w_2 - w^*||^2 + \frac{1}{2A_2}||w_3 - w^*||^2 + ... + \frac{1}{2A_T}||w_{T+1} - w^*||^2 \right) \\{} & {} \quad + \sum _{t=1}^T\frac{A_t}{2} ||g_t||^2 \\{} & {} \quad = \frac{1}{2A_1}||w_1- w^*||^2 + \sum _{t=1}^{T-1} \left( \frac{1}{2A_{t+1}} - \frac{1}{2A_t} \right) ||w_{t+1} - w^*||^2 - \\{} & {} \quad - \frac{1}{2A_T}||w_{T+1} - w^*||^2 + \frac{1}{2}\sum _{t=1}^T A_t ||g_t||^2 \\{} & {} \quad \le \frac{1}{2A_1}||w_1- w^*||^2 + \sum _{t=1}^{T-1} \left( \frac{1}{2A_{t+1}} - \frac{1}{2A_t} \right) ||w_{t+1} - w^*||^2 + \\{} & {} \quad + \frac{1}{2}\sum _{t=1}^T A_t ||g_t||^2 \end{aligned}$$
Because the gradients are bounded
\(\Vert g_t\Vert \le G\) and using lemma
4 we have:
$$\begin{aligned}{} & {} \le \frac{4r^2}{2A_1} + \frac{1}{2} \sum _{t=1}^T A_t G^2 + \sum _{t=1}^{T-1} \left( \frac{1}{2A_{t+1}} - \frac{1}{2A_t} \right) ||w_{t+1} - w^*||^2 \\{} & {} \quad = \frac{2r^2}{A_1} + \frac{G^2}{2} \sum _{t=1}^T A_t + \frac{1}{2}\sum _{t=1}^{T-1} \left( \frac{1}{A_{t+1}} - \frac{1}{A_t} \right) ||w_{t+1} - w^*||^2 \end{aligned}$$
From lemma
5 we prove that with certain
\(c_t\) the difference is positive so we can use lemma
4 to get:
$$\begin{aligned} \le \frac{2r^2}{A_1} + \frac{G^2}{2} \sum _{t=1}^T A_t + \frac{4r^2}{2}\sum _{t=1}^{T-1} \left( \frac{1}{A_{t+1}} - \frac{1}{A_t} \right) \end{aligned}$$
Now the sum becomes telescopic:
$$\begin{aligned}{} & {} = \frac{2r^2}{A_1} + \frac{G^2}{2} \sum _{t=1}^T A_t + 2r^2 \left( \frac{1}{A_{T}} - \frac{1}{A_1} \right) \\{} & {} \quad = \frac{2r^2}{A_T} + \frac{G^2}{2} \sum _{t=1}^T A_t\\{} & {} \quad = \frac{2r^2 \sqrt{T}}{\alpha _0 S_T} + \frac{G^2 \alpha _0}{2} \sum _{t=1}^T \frac{S_t}{\sqrt{t}}\end{aligned}$$
Clearly the behavior of
\(A_t\) affects the convergence of the whole algorithm. The left term shows that if
\(A_t\) decreases really fast (i.e.
\(\mathcal {O}(1/t^2)\)) the algorithm diverges. On the other hand, if
\(A_t\) is constant or increases then the right term (the sum) will diverge. Lemma
5 shows that
\(c_t\) controls how fast
\(A_t\) decreases. There can be many
\(c_t\) that lead to convergence with various speeds. Here one of them is presented that satisfies the following equation:
Using the bounds of
\(S_t\) from Lemma
2 we have:
$$\begin{aligned} R(T) \le \frac{2r^2 \sqrt{T}}{\alpha _0 \frac{M (1 - \gamma )}{2G + \max _T(c_T)}} + \frac{G^2 \alpha _0}{2} \sum _{t=1}^T \frac{1}{\sqrt{t}} \left( 1 + \frac{G }{\min _T(c_T)} \right) \end{aligned}$$
From the Eq.
A.3 of
\(c_t\) it is clear that
\(c_t\) is non-decreasing, thus, the above inequality transforms into:
$$\begin{aligned}{} & {} R(T) \le \frac{2r^2 (2G + c_T) \sqrt{T}}{\alpha _0 M (1 - \gamma )} + \frac{G^2 \alpha _0}{2} \left( 1 + \frac{G }{c_1} \right) \sum _{t=1}^T \frac{1}{\sqrt{t}} \\{} & {} \quad R(T) \le \frac{2r^2 (2G + c_T) \sqrt{T}}{\alpha _0 M (1 - \gamma )} + \frac{G^2 \alpha _0}{2} \left( 1 + \frac{G }{c_1} \right) \int _0^T \frac{1}{\sqrt{t}} dt \\{} & {} \quad R(T) \le \frac{2r^2 (2G + c_T) \sqrt{T}}{\alpha _0 M (1 - \gamma )} + G^2 \alpha _0 \left( 1 + \frac{G }{c_1} \right) \sqrt{T} \end{aligned}$$
Dividing by
T:
$$\begin{aligned} \frac{R(T)}{T} \le \frac{2r^2 (2G + c_T)}{\alpha _0 M (1 - \gamma ) \sqrt{T}} + \frac{G^2 \alpha _0}{\sqrt{T}} \left( 1 + \frac{G }{c_1} \right) \end{aligned}$$
which shows that:
$$\begin{aligned} \frac{R(T)}{T} = \mathcal {O} \left( \frac{1}{\sqrt{T}} \right) \,\, \text {and }\,\, \lim _{T\rightarrow \infty } \frac{R(T)}{T} = 0 \end{aligned}$$
\(\square \)