Top

Published in:

2019 | OriginalPaper | Chapter

Smooth Minimization of Nonsmooth Functions with Parallel Coordinate Descent Methods

Authors : Olivier Fercoq, Peter Richtárik

Published in: Modeling and Optimization: Theory and Applications

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

We study the performance of a family of randomized parallel coordinate descent methods for minimizing a nonsmooth nonseparable convex function. The problem class includes as a special case L1-regularized L1 regression and the minimization of the exponential loss (“AdaBoost problem”). We assume that the input data defining the loss function is contained in a sparse \(m\times n\) matrix A with at most \(\omega \) nonzeros in each row and that the objective function has a “max structure”, allowing us to smooth it. Our main contribution consists in identifying parameters with a closed-form expression that guarantees a parallelization speedup that depends on basic quantities of the problem (like its size and the number of processors). The theory relies on a fine study of the Lipschitz constant of the smoothed objective restricted to low dimensional subspaces and shows an increased acceleration for sparser problems.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Efficient Piecewise Linearization for a Class of Non-convex Optimization Problems: Comparative Results and Extensions

next chapter Second-Order Cone Optimization Formulations for Service System Design Problems with Congestion

The results presented in this paper were obtained the Fall of 2012 and Spring of 2013, the follow-up work [6] was prepared in the Summer of 2013.

We coined the term Nesterov separability in honor of Yu. Nesterov’s seminal work on the smoothing technique [19], which is applicable to functions represented in the form (8). Nesterov did not study problems with row-sparse matrices A, as we do in this work, nor did he study parallel coordinate descent methods. However, he proposed the celebrated smoothing technique which we also employ in this paper.

This is the case in many cases, including (i) \(\Psi _i(t) = \lambda _i|t|\), (ii) \(\Psi _i(t) = \lambda _i t^2\), and (iii) \(\Psi _i(t) = 0\) for \(t \in [a_i,b_i]\) and \(+\infty \) outside this interval (and the multivariate/block generalizations of these functions). For complicated functions \(\Psi _i(t)\), one may need to do one-dimensional optimization, which will cost O(1) for each i, provided that we are happy with an inexact solution. An analysis of PCDM in the \(\tau =1\) case in such an inexact setting can be found in Tappenden et al. [39], and can be extended to the parallel setting.

Note that \(h^{(S)}\) is different from \(h_{[S]}=\sum _{i \in S} U_i h^{(i)}\), which is a vector in \(\mathbb {R}^N\), although both \(h^{(S)}\) and \(h_{[S]}\) are composed of blocks \(h^{(i)}\) for \(i \in S\).

In fact, the proof of the former is essentially identical to the proof of (44), and (46) follows from (44) by choosing \(J_1=J_2=J\) and \(\theta _{ij} = 1\).

This assumption is not restrictive as \(\beta ' \ge 1\), \(n \ge \tau \) and \(\epsilon \) is usually small. However, it is technically needed.

Without the assumption \(\beta '=\min \{\omega ,\tau \}\), the algorithm still converges but with a proved complexity in \(O\big (1/(\epsilon \rho )\big )\) instead of \(O\big (1/\epsilon \log (1/(\epsilon \rho ))\big )\) [40]. In our experiments, we have never encountered a problem with using the more efficient \(\tau \)-nice sampling even in the non-strongly convex case. In fact, this weaker result may just be an artifact of the analysis.

Some synchronization do take place from time to time for monitoring purposes.

Bradley, J.K., Kyrola, A., Bickson, D., Guestrin, C.: Parallel coordinate descent for L1-regularized loss minimization. In: 28th International Conference on Machine Learning (2011)

Bian, Y., Li, X., Liu, Y.: Parallel coordinate descent newton for large-scale L1-regularized minimization. arXiv:1306:4080v1 (2013)

Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, adaboost and bregman distances. Mach. Learn. 48(1–3), 253–285 (2002)CrossRef

Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Prog. 1–39 (2013)

Dang, C.D., Lan, G: Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM J. Opt. 25(2), 856–881 (2015)

Fercoq, O.: Parallel coordinate descent for the AdaBoost problem. In: International Conference on Machine Learning and Applications—ICMLA ’13 (2013)

Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Opt. 25(4), 1997–2023 (2015)

Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Computational Learning Theory, pp. 23–37. Springer (1995)

Guyon, I., Gunn, S., Ben-Hur, A., Dror, G.: Result analysis of the NIPS 2003 feature selection challenge. Adv. Neural Inf. Process. Syst. 17, 545–552 (2004)

10.

Journée, M., Nesterov, Y., Richtárik, P., Sepulchre, R.: Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11, 517–553 (2010)MathSciNetMATH

11.

Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletcher, P.: Block-coordinate frank-wolfe optimization for structural svms. In: 30th International Conference on Machine Learning (2013)

12.

Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Op. Res. 35(3), 641–654 (2010)MathSciNetCrossRef

13.

Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(285–322), 1–5 (2015)

14.

Liu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Prog. 152(1–2), 615–642 (2015)

15.

Mukherjee, I., Canini, K., Frongillo, R., Singer, Y.: Parallel boosting with momentum. In: Lecture Notes in Computer Science, vol. 8188. Machine Learning and Knowledge Discovery in Databases, ECML (2013)

16.

Mukherjee, I., Rudin, C., Schapire, R.E.: The rate of convergence of AdaBoost. J. Mach. Learn. Res. 14(1), 2315–2347 (2013)

17.

Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Identifying suspicious urls: an application of large-scale online learning. In: Proceedings of the 26th International Conference on Machine Learning, pp. 681–688. ACM (2009)

18.

Necoara, I., Clipici, D.: Efficient parallel coordinate descent algorithm for convex optimization problems with separable constraints: application to distributed mpc. J. Process Control 23, 243–253 (2013)CrossRef

19.

Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Prog. 103, 127–152 (2005)

20.

Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Opt. 22(2), 341–362 (2012)MathSciNetCrossRef

21.

Nesterov, Y.: Subgradient methods for huge-scale optimization problems. Math. Prog. 146(1), 275–297 (2014)

22.

Necoara, I., Nesterov, Y., Glineur, F.: Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Politehnica Univ. of Bucharest, Technical report (2012)

23.

Necoara, I., Patrascu, A.: A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints. Comput. Opt. Appl. 57(2), 307–337 (2014)MathSciNetCrossRef

24.

Palit, I., Reddy, C.K.: Scalable and parallel boosting with MapReduce. IEEE Trans. Knowl. Data Eng. 24(10), 1904–1916 (2012)CrossRef

25.

Richtárik, P., Takáč, M.: Efficiency of randomized coordinate descent methods on minimization problems with a composite objective function. In: 4th Workshop on Signal Processing with Adaptive Sparse Structured Representations, June 2011

26.

Richtárik, P., Takáč, M.: Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. In: Operations Research Proceedings, pp. 27–32. Springer (2012)

27.

Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Prog. 144(2), 1–38 (2014)

28.

Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Prog. 156(1), 433–484 (2016)

29.

Richtárik, P., Takáč, M., Damla Ahipaşaoğlu, S.: Alternating maximization: unifying framework for 8 sparse PCA formulations and efficient parallel codes. arXiv:1212:4137 (2012)

30.

Ruszczyński, A.: On convergence of an augmented Lagrangian decomposition method for sparse convex optimization. Math. Op. Res. 20(3), 634–656 (1995)MathSciNetCrossRef

31.

Schapire, R.E., Freund, Y.: Boosting: Foundations and Algorithms. The MIT Press (2012)

32.

Shalev-Shwartz, S., Tewari, A.: Stochastic methods for \(\ell _1\)-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)MathSciNetMATH

33.

Shalev-Shwartz, S., Zhang, T.: Accelerated mini-batch stochastic dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp. 378–385 (2013)

34.

Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)MathSciNetMATH

35.

Takáč, M., Bijral, A., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMs. In: 30th International Conference on Machine Learning (2013)

36.

Telgarsky, M.: A primal-dual convergence analysis of boosting. J. Mach. Learn. Res. 13, 561–606 (2012)MathSciNetMATH

37.

Tao, Q., Kong, K., Chu, D., Wu, G.: Stochastic coordinate descent methods for regularized smooth and nonsmooth losses. In: Machine Learning and Knowledge Discovery in Databases, pp. 537–552 (2012)

38.

Tappenden, R., Richtárik, P., Büke, B.: Separable approximations and decomposition methods for the augmented Lagrangian. Opt. Methods Softw. 30(3), 643–668 (2015)MathSciNetCrossRef

39.

Tappenden, R., Richtárik, P., Gondzio, J.: Inexact coordinate descent: complexity and preconditioning. J. Opt. Theory Appl. 1–33 (2016)

40.

Tappenden, R., Takáč, M., Richtárik, P.: On the complexity of parallel coordinate descent. Opt. Methods Softw. 1–24 (2017)

Title: Smooth Minimization of Nonsmooth Functions with Parallel Coordinate Descent Methods
Authors: Olivier Fercoq
Peter Richtárik
Publisher: Springer International Publishing
Book: Modeling and Optimization: Theory and Applications
Print ISBN: 978-3-030-12118-1

Electronic ISBN: 978-3-030-12119-8

Copyright Year: 2019
DOI: https://doi.org/10.1007/978-3-030-12119-8_4

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"