Article

Scalable training of L¹-regularized log-linear models

Authors:
Galen Andrew

Microsoft Research, One Microsoft Way, Redmond, WA

Microsoft Research, One Microsoft Way, Redmond, WA
View Profile

,
Jianfeng Gao

Microsoft Research, One Microsoft Way, Redmond, WA

Microsoft Research, One Microsoft Way, Redmond, WA
View Profile

ICML '07: Proceedings of the 24th international conference on Machine learningJune 2007Pages 33–40https://doi.org/10.1145/1273496.1273501

Published:20 June 2007Publication History

ICML '07: Proceedings of the 24th international conference on Machine learning

Pages 33–40

ABSTRACT

The L-BFGS limited-memory quasi-Newton method is the algorithm of choice for optimizing the parameters of large-scale log-linear models with L₂ regularization, but it cannot be used for an L₁-regularized loss due to its non-differentiability whenever some parameter is zero. Efficient algorithms have been proposed for this task, but they are impractical when the number of parameters is very large. We present an algorithm Orthant-Wise Limited-memory Quasi-Newton (OWL-QN), based on L-BFGS, that can efficiently optimize the L₁-regularized log-likelihood of log-linear models with millions of parameters. In our experiments on a parse reranking task, our algorithm was several orders of magnitude faster than an alternative algorithm, and substantially faster than L-BFGS on the analogous L₂-regularized problem. We also present a proof that OWL-QN is guaranteed to converge to a globally optimal parameter vector.

References

Benson, J. S., & More, J. J. (2001). A limited memory variable metric method for bound constraint minimization.Google Scholar
Bertsekas, D. P. (1999). Nonlinear programming. Athena Scientific.Google Scholar
Byrd, R. H., Lu, P., Nocedal, J., & Zhu, C. Y. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16, 1190--1208. Google ScholarDigital Library
Charniak, E., & Johnson, M. (2005). Coarse-to-fine n-best parsing and maxent discriminative reranking. ACL. Google ScholarDigital Library
Collins, M. (2000). Discriminative reranking for natural language parsing. ICML (pp. 175--182). Google ScholarDigital Library
Darroch, J., & Ratcliff, D. (1972). Generalised iterative scaling for log-linear models. Annals of Mathematical Statistics.Google ScholarCross Ref
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics.Google Scholar
Gao, J., Andrew, G., Johnson, M., & Toutanova, K. (2007). A comparative study of parameter estimation methods for statistical NLP. ACL.Google Scholar
Goodman, J. (2004). Exponential priors for maximum entropy models. ACL.Google Scholar
Kazama, J., & Tsujii, J. (2003). Evaluation and extension of maximum entropy models with inequality constraints. EMNLP. Google ScholarDigital Library
Lee, S.-I., Lee, H., Abbeel, P., & Ng, A. (2006). Efficient L1 regularized logistic regression. AAAI-06.Google Scholar
Malouf, R. (2002). A comparison of algorithms for maximum entropy parameter estimation. CONLL. Google ScholarDigital Library
Minka, T. P. (2003). A comparison of numerical optimizers for logistic regression (Technical Report). Microsoft Research.Google Scholar
Ng, A. Y. (2004). Feature selection, L1 vs. L2 regularization, and rotational invariance. ICML. Google ScholarDigital Library
Nocedal, J., & Wright, S. J. (1999). Numerical optimization. Springer.Google Scholar
Perkins, S., & Theiler, J. (2003). Online feature selection using grafting. ICML.Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B.Google Scholar
Zhu, C., Byrd, R. H., Lu, P., & Nocedal, J. (1997). Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw., 23, 550--560. Google ScholarDigital Library

Scalable training of L¹-regularized log-linear models
1. Computing methodologies
2. Theory of computation

Recommendations

An inexact successive quadratic approximation method for L-1 regularized optimization

We study a Newton-like method for the minimization of an objective function $$\phi $$ that is the sum of a smooth function and an $$\ell _1$$ℓ1 regularization term. This method, which is sometimes referred to in the literature as a proximal Newton ...
Read More
An Augmented Lagrangian Method for $\ell_{1}$-Regularized Optimization Problems with Orthogonality Constraints

A class of $\ell_1$-regularized optimization problems with orthogonality constraints has been used to model various applications arising from physics and information sciences, e.g., compressed modes for variational problems. Such optimization problems are ...
Read More
Nonmonotone Barzilai---Borwein Gradient Algorithm for $$\ell _{1}$$ℓ1-Regularized Nonsmooth Minimization in Compressive Sensing

This study aims to minimize the sum of a smooth function and a nonsmooth $$\ell _{1}$$ ℓ 1 -regularized term. This problem as a special case includes the $$\ell _{1}$$ ℓ 1 -regularized convex minimization problem in signal processing, compressive sensing, machine learning, data mining, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICML '07: Proceedings of the 24th international conference on Machine learning
June 2007
1233 pages
ISBN:9781595937933
DOI:10.1145/1273496
Editor:
Zoubin Ghahramani
University of Cambridge, United Kingdom
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate140of548submissions,26%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 219
  Total Citations
  View Citations
- 1,414
  Total Downloads
- Downloads (Last 12 months)101
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable training of L¹-regularized log-linear models

ICML '07: Proceedings of the 24th international conference on Machine learning

ABSTRACT

References

Cited By

Recommendations

An inexact successive quadratic approximation method for L-1 regularized optimization

An Augmented Lagrangian Method for $\ell_{1}$-Regularized Optimization Problems with Orthogonality Constraints

Nonmonotone Barzilai---Borwein Gradient Algorithm for $$\ell _{1}$$ℓ1-Regularized Nonsmooth Minimization in Compressive Sensing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Scalable training of L1-regularized log-linear models

ICML '07: Proceedings of the 24th international conference on Machine learning

ABSTRACT

References

Cited By

Recommendations

An inexact successive quadratic approximation method for L-1 regularized optimization

An Augmented Lagrangian Method for $\ell_{1}$-Regularized Optimization Problems with Orthogonality Constraints

Nonmonotone Barzilai---Borwein Gradient Algorithm for $$\ell _{1}$$ℓ1-Regularized Nonsmooth Minimization in Compressive Sensing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media

Scalable training of L¹-regularized log-linear models