skip to main content
10.1145/1273496.1273501acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
Article

Scalable training of L1-regularized log-linear models

Published:20 June 2007Publication History

ABSTRACT

The L-BFGS limited-memory quasi-Newton method is the algorithm of choice for optimizing the parameters of large-scale log-linear models with L2 regularization, but it cannot be used for an L1-regularized loss due to its non-differentiability whenever some parameter is zero. Efficient algorithms have been proposed for this task, but they are impractical when the number of parameters is very large. We present an algorithm Orthant-Wise Limited-memory Quasi-Newton (OWL-QN), based on L-BFGS, that can efficiently optimize the L1-regularized log-likelihood of log-linear models with millions of parameters. In our experiments on a parse reranking task, our algorithm was several orders of magnitude faster than an alternative algorithm, and substantially faster than L-BFGS on the analogous L2-regularized problem. We also present a proof that OWL-QN is guaranteed to converge to a globally optimal parameter vector.

References

  1. Benson, J. S., & More, J. J. (2001). A limited memory variable metric method for bound constraint minimization.Google ScholarGoogle Scholar
  2. Bertsekas, D. P. (1999). Nonlinear programming. Athena Scientific.Google ScholarGoogle Scholar
  3. Byrd, R. H., Lu, P., Nocedal, J., & Zhu, C. Y. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16, 1190--1208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Charniak, E., & Johnson, M. (2005). Coarse-to-fine n-best parsing and maxent discriminative reranking. ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Collins, M. (2000). Discriminative reranking for natural language parsing. ICML (pp. 175--182). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Darroch, J., & Ratcliff, D. (1972). Generalised iterative scaling for log-linear models. Annals of Mathematical Statistics.Google ScholarGoogle ScholarCross RefCross Ref
  7. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics.Google ScholarGoogle Scholar
  8. Gao, J., Andrew, G., Johnson, M., & Toutanova, K. (2007). A comparative study of parameter estimation methods for statistical NLP. ACL.Google ScholarGoogle Scholar
  9. Goodman, J. (2004). Exponential priors for maximum entropy models. ACL.Google ScholarGoogle Scholar
  10. Kazama, J., & Tsujii, J. (2003). Evaluation and extension of maximum entropy models with inequality constraints. EMNLP. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Lee, S.-I., Lee, H., Abbeel, P., & Ng, A. (2006). Efficient L1 regularized logistic regression. AAAI-06.Google ScholarGoogle Scholar
  12. Malouf, R. (2002). A comparison of algorithms for maximum entropy parameter estimation. CONLL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Minka, T. P. (2003). A comparison of numerical optimizers for logistic regression (Technical Report). Microsoft Research.Google ScholarGoogle Scholar
  14. Ng, A. Y. (2004). Feature selection, L1 vs. L2 regularization, and rotational invariance. ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Nocedal, J., & Wright, S. J. (1999). Numerical optimization. Springer.Google ScholarGoogle Scholar
  16. Perkins, S., & Theiler, J. (2003). Online feature selection using grafting. ICML.Google ScholarGoogle Scholar
  17. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B.Google ScholarGoogle Scholar
  18. Zhu, C., Byrd, R. H., Lu, P., & Nocedal, J. (1997). Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw., 23, 550--560. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Scalable training of L1-regularized log-linear models

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        ICML '07: Proceedings of the 24th international conference on Machine learning
        June 2007
        1233 pages
        ISBN:9781595937933
        DOI:10.1145/1273496

        Copyright © 2007 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 June 2007

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate140of548submissions,26%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader