Skip to main content
Top

2019 | OriginalPaper | Chapter

Smooth Minimization of Nonsmooth Functions with Parallel Coordinate Descent Methods

Authors : Olivier Fercoq, Peter Richtárik

Published in: Modeling and Optimization: Theory and Applications

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We study the performance of a family of randomized parallel coordinate descent methods for minimizing a nonsmooth nonseparable convex function. The problem class includes as a special case L1-regularized L1 regression and the minimization of the exponential loss (“AdaBoost problem”). We assume that the input data defining the loss function is contained in a sparse \(m\times n\) matrix A with at most \(\omega \) nonzeros in each row and that the objective function has a “max structure”, allowing us to smooth it. Our main contribution consists in identifying parameters with a closed-form expression that guarantees a parallelization speedup that depends on basic quantities of the problem (like its size and the number of processors). The theory relies on a fine study of the Lipschitz constant of the smoothed objective restricted to low dimensional subspaces and shows an increased acceleration for sparser problems.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
1
The results presented in this paper were obtained the Fall of 2012 and Spring of 2013, the follow-up work [6] was prepared in the Summer of 2013.
 
2
We coined the term Nesterov separability in honor of Yu. Nesterov’s seminal work on the smoothing technique [19], which is applicable to functions represented in the form (8). Nesterov did not study problems with row-sparse matrices A, as we do in this work, nor did he study parallel coordinate descent methods. However, he proposed the celebrated smoothing technique which we also employ in this paper.
 
3
This is the case in many cases, including (i) \(\Psi _i(t) = \lambda _i|t|\), (ii) \(\Psi _i(t) = \lambda _i t^2\), and (iii) \(\Psi _i(t) = 0\) for \(t \in [a_i,b_i]\) and \(+\infty \) outside this interval (and the multivariate/block generalizations of these functions). For complicated functions \(\Psi _i(t)\), one may need to do one-dimensional optimization, which will cost O(1) for each i, provided that we are happy with an inexact solution. An analysis of PCDM in the \(\tau =1\) case in such an inexact setting can be found in Tappenden et al. [39], and can be extended to the parallel setting.
 
4
Note that \(h^{(S)}\) is different from \(h_{[S]}=\sum _{i \in S} U_i h^{(i)}\), which is a vector in \(\mathbb {R}^N\), although both \(h^{(S)}\) and \(h_{[S]}\) are composed of blocks \(h^{(i)}\) for \(i \in S\).
 
5
In fact, the proof of the former is essentially identical to the proof of (44), and (46) follows from (44) by choosing \(J_1=J_2=J\) and \(\theta _{ij} = 1\).
 
6
This assumption is not restrictive as \(\beta ' \ge 1\), \(n \ge \tau \) and \(\epsilon \) is usually small. However, it is technically needed.
 
7
Without the assumption \(\beta '=\min \{\omega ,\tau \}\), the algorithm still converges but with a proved complexity in \(O\big (1/(\epsilon \rho )\big )\) instead of \(O\big (1/\epsilon \log (1/(\epsilon \rho ))\big )\) [40]. In our experiments, we have never encountered a problem with using the more efficient \(\tau \)-nice sampling even in the non-strongly convex case. In fact, this weaker result may just be an artifact of the analysis.
 
8
Some synchronization do take place from time to time for monitoring purposes.
 
Literature
1.
go back to reference Bradley, J.K., Kyrola, A., Bickson, D., Guestrin, C.: Parallel coordinate descent for L1-regularized loss minimization. In: 28th International Conference on Machine Learning (2011) Bradley, J.K., Kyrola, A., Bickson, D., Guestrin, C.: Parallel coordinate descent for L1-regularized loss minimization. In: 28th International Conference on Machine Learning (2011)
2.
3.
go back to reference Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, adaboost and bregman distances. Mach. Learn. 48(1–3), 253–285 (2002)CrossRef Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, adaboost and bregman distances. Mach. Learn. 48(1–3), 253–285 (2002)CrossRef
4.
go back to reference Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Prog. 1–39 (2013) Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Prog. 1–39 (2013)
5.
go back to reference Dang, C.D., Lan, G: Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM J. Opt. 25(2), 856–881 (2015) Dang, C.D., Lan, G: Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM J. Opt. 25(2), 856–881 (2015)
6.
go back to reference Fercoq, O.: Parallel coordinate descent for the AdaBoost problem. In: International Conference on Machine Learning and Applications—ICMLA ’13 (2013) Fercoq, O.: Parallel coordinate descent for the AdaBoost problem. In: International Conference on Machine Learning and Applications—ICMLA ’13 (2013)
7.
go back to reference Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Opt. 25(4), 1997–2023 (2015) Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Opt. 25(4), 1997–2023 (2015)
8.
go back to reference Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Computational Learning Theory, pp. 23–37. Springer (1995) Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Computational Learning Theory, pp. 23–37. Springer (1995)
9.
go back to reference Guyon, I., Gunn, S., Ben-Hur, A., Dror, G.: Result analysis of the NIPS 2003 feature selection challenge. Adv. Neural Inf. Process. Syst. 17, 545–552 (2004) Guyon, I., Gunn, S., Ben-Hur, A., Dror, G.: Result analysis of the NIPS 2003 feature selection challenge. Adv. Neural Inf. Process. Syst. 17, 545–552 (2004)
10.
go back to reference Journée, M., Nesterov, Y., Richtárik, P., Sepulchre, R.: Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11, 517–553 (2010)MathSciNetMATH Journée, M., Nesterov, Y., Richtárik, P., Sepulchre, R.: Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11, 517–553 (2010)MathSciNetMATH
11.
go back to reference Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletcher, P.: Block-coordinate frank-wolfe optimization for structural svms. In: 30th International Conference on Machine Learning (2013) Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletcher, P.: Block-coordinate frank-wolfe optimization for structural svms. In: 30th International Conference on Machine Learning (2013)
12.
go back to reference Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Op. Res. 35(3), 641–654 (2010)MathSciNetCrossRef Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Op. Res. 35(3), 641–654 (2010)MathSciNetCrossRef
13.
go back to reference Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(285–322), 1–5 (2015) Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(285–322), 1–5 (2015)
14.
go back to reference Liu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Prog. 152(1–2), 615–642 (2015) Liu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Prog. 152(1–2), 615–642 (2015)
15.
go back to reference Mukherjee, I., Canini, K., Frongillo, R., Singer, Y.: Parallel boosting with momentum. In: Lecture Notes in Computer Science, vol. 8188. Machine Learning and Knowledge Discovery in Databases, ECML (2013) Mukherjee, I., Canini, K., Frongillo, R., Singer, Y.: Parallel boosting with momentum. In: Lecture Notes in Computer Science, vol. 8188. Machine Learning and Knowledge Discovery in Databases, ECML (2013)
16.
go back to reference Mukherjee, I., Rudin, C., Schapire, R.E.: The rate of convergence of AdaBoost. J. Mach. Learn. Res. 14(1), 2315–2347 (2013) Mukherjee, I., Rudin, C., Schapire, R.E.: The rate of convergence of AdaBoost. J. Mach. Learn. Res. 14(1), 2315–2347 (2013)
17.
go back to reference Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Identifying suspicious urls: an application of large-scale online learning. In: Proceedings of the 26th International Conference on Machine Learning, pp. 681–688. ACM (2009) Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Identifying suspicious urls: an application of large-scale online learning. In: Proceedings of the 26th International Conference on Machine Learning, pp. 681–688. ACM (2009)
18.
go back to reference Necoara, I., Clipici, D.: Efficient parallel coordinate descent algorithm for convex optimization problems with separable constraints: application to distributed mpc. J. Process Control 23, 243–253 (2013)CrossRef Necoara, I., Clipici, D.: Efficient parallel coordinate descent algorithm for convex optimization problems with separable constraints: application to distributed mpc. J. Process Control 23, 243–253 (2013)CrossRef
19.
go back to reference Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Prog. 103, 127–152 (2005) Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Prog. 103, 127–152 (2005)
20.
go back to reference Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Opt. 22(2), 341–362 (2012)MathSciNetCrossRef Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Opt. 22(2), 341–362 (2012)MathSciNetCrossRef
21.
go back to reference Nesterov, Y.: Subgradient methods for huge-scale optimization problems. Math. Prog. 146(1), 275–297 (2014) Nesterov, Y.: Subgradient methods for huge-scale optimization problems. Math. Prog. 146(1), 275–297 (2014)
22.
go back to reference Necoara, I., Nesterov, Y., Glineur, F.: Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Politehnica Univ. of Bucharest, Technical report (2012) Necoara, I., Nesterov, Y., Glineur, F.: Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Politehnica Univ. of Bucharest, Technical report (2012)
23.
go back to reference Necoara, I., Patrascu, A.: A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints. Comput. Opt. Appl. 57(2), 307–337 (2014)MathSciNetCrossRef Necoara, I., Patrascu, A.: A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints. Comput. Opt. Appl. 57(2), 307–337 (2014)MathSciNetCrossRef
24.
go back to reference Palit, I., Reddy, C.K.: Scalable and parallel boosting with MapReduce. IEEE Trans. Knowl. Data Eng. 24(10), 1904–1916 (2012)CrossRef Palit, I., Reddy, C.K.: Scalable and parallel boosting with MapReduce. IEEE Trans. Knowl. Data Eng. 24(10), 1904–1916 (2012)CrossRef
25.
go back to reference Richtárik, P., Takáč, M.: Efficiency of randomized coordinate descent methods on minimization problems with a composite objective function. In: 4th Workshop on Signal Processing with Adaptive Sparse Structured Representations, June 2011 Richtárik, P., Takáč, M.: Efficiency of randomized coordinate descent methods on minimization problems with a composite objective function. In: 4th Workshop on Signal Processing with Adaptive Sparse Structured Representations, June 2011
26.
go back to reference Richtárik, P., Takáč, M.: Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. In: Operations Research Proceedings, pp. 27–32. Springer (2012) Richtárik, P., Takáč, M.: Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. In: Operations Research Proceedings, pp. 27–32. Springer (2012)
27.
go back to reference Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Prog. 144(2), 1–38 (2014) Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Prog. 144(2), 1–38 (2014)
28.
go back to reference Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Prog. 156(1), 433–484 (2016) Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Prog. 156(1), 433–484 (2016)
29.
go back to reference Richtárik, P., Takáč, M., Damla Ahipaşaoğlu, S.: Alternating maximization: unifying framework for 8 sparse PCA formulations and efficient parallel codes. arXiv:1212:4137 (2012) Richtárik, P., Takáč, M., Damla Ahipaşaoğlu, S.: Alternating maximization: unifying framework for 8 sparse PCA formulations and efficient parallel codes. arXiv:​1212:​4137 (2012)
30.
go back to reference Ruszczyński, A.: On convergence of an augmented Lagrangian decomposition method for sparse convex optimization. Math. Op. Res. 20(3), 634–656 (1995)MathSciNetCrossRef Ruszczyński, A.: On convergence of an augmented Lagrangian decomposition method for sparse convex optimization. Math. Op. Res. 20(3), 634–656 (1995)MathSciNetCrossRef
31.
go back to reference Schapire, R.E., Freund, Y.: Boosting: Foundations and Algorithms. The MIT Press (2012) Schapire, R.E., Freund, Y.: Boosting: Foundations and Algorithms. The MIT Press (2012)
32.
go back to reference Shalev-Shwartz, S., Tewari, A.: Stochastic methods for \(\ell _1\)-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)MathSciNetMATH Shalev-Shwartz, S., Tewari, A.: Stochastic methods for \(\ell _1\)-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)MathSciNetMATH
33.
go back to reference Shalev-Shwartz, S., Zhang, T.: Accelerated mini-batch stochastic dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp. 378–385 (2013) Shalev-Shwartz, S., Zhang, T.: Accelerated mini-batch stochastic dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp. 378–385 (2013)
34.
go back to reference Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)MathSciNetMATH Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)MathSciNetMATH
35.
go back to reference Takáč, M., Bijral, A., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMs. In: 30th International Conference on Machine Learning (2013) Takáč, M., Bijral, A., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMs. In: 30th International Conference on Machine Learning (2013)
36.
go back to reference Telgarsky, M.: A primal-dual convergence analysis of boosting. J. Mach. Learn. Res. 13, 561–606 (2012)MathSciNetMATH Telgarsky, M.: A primal-dual convergence analysis of boosting. J. Mach. Learn. Res. 13, 561–606 (2012)MathSciNetMATH
37.
go back to reference Tao, Q., Kong, K., Chu, D., Wu, G.: Stochastic coordinate descent methods for regularized smooth and nonsmooth losses. In: Machine Learning and Knowledge Discovery in Databases, pp. 537–552 (2012) Tao, Q., Kong, K., Chu, D., Wu, G.: Stochastic coordinate descent methods for regularized smooth and nonsmooth losses. In: Machine Learning and Knowledge Discovery in Databases, pp. 537–552 (2012)
38.
go back to reference Tappenden, R., Richtárik, P., Büke, B.: Separable approximations and decomposition methods for the augmented Lagrangian. Opt. Methods Softw. 30(3), 643–668 (2015)MathSciNetCrossRef Tappenden, R., Richtárik, P., Büke, B.: Separable approximations and decomposition methods for the augmented Lagrangian. Opt. Methods Softw. 30(3), 643–668 (2015)MathSciNetCrossRef
39.
go back to reference Tappenden, R., Richtárik, P., Gondzio, J.: Inexact coordinate descent: complexity and preconditioning. J. Opt. Theory Appl. 1–33 (2016) Tappenden, R., Richtárik, P., Gondzio, J.: Inexact coordinate descent: complexity and preconditioning. J. Opt. Theory Appl. 1–33 (2016)
40.
go back to reference Tappenden, R., Takáč, M., Richtárik, P.: On the complexity of parallel coordinate descent. Opt. Methods Softw. 1–24 (2017) Tappenden, R., Takáč, M., Richtárik, P.: On the complexity of parallel coordinate descent. Opt. Methods Softw. 1–24 (2017)
Metadata
Title
Smooth Minimization of Nonsmooth Functions with Parallel Coordinate Descent Methods
Authors
Olivier Fercoq
Peter Richtárik
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-030-12119-8_4