Abstract
Discriminative training approaches like structural SVMs have shown much promise for building highly complex and accurate models in areas like natural language processing, protein structure prediction, and information retrieval. However, current training algorithms are computationally expensive or intractable on large datasets. To overcome this bottleneck, this paper explores how cutting-plane methods can provide fast training not only for classification SVMs, but also for structural SVMs. We show that for an equivalent “1-slack” reformulation of the linear SVM training problem, our cutting-plane method has time complexity linear in the number of training examples. In particular, the number of iterations does not depend on the number of training examples, and it is linear in the desired precision and the regularization parameter. Furthermore, we present an extensive empirical evaluation of the method applied to binary classification, multi-class classification, HMM sequence tagging, and CFG parsing. The experiments show that the cutting-plane algorithm is broadly applicable and fast in practice. On large datasets, it is typically several orders of magnitude faster than conventional training methods derived from decomposition methods like SVM-light, or conventional cutting-plane methods. Implementations of our methods are available at www.joachims.org.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003). Hidden Markov support vector machines. In International conference on machine learning (ICML) (pp. 3–10).
Anguelov, D., Taskar, B., Chatalbashev, V., Koller, D., Gupta, D., Heitz, G., & Ng, A. Y. (2005). Discriminative learning of Markov random fields for segmentation of 3D scan data. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 169–176). Los Alamitos: IEEE Computer Society.
Bartlett, P., Collins, M., Taskar, B., & McAllester, D. (2004). Exponentiated algorithms for large-margin structured classification. In Advances in neural information processing systems (NIPS) (pp. 305–312).
Caruana, R., Joachims, T., & Backstrom, L. (2004). KDDCup 2004: Results and analysis. ACM SIGKDD Newsletter, 6(2), 95–108.
Chang, C. C., & Lin, C. J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Empirical methods in natural language processing (EMNLP) (pp. 1–8).
Collins, M. (2004). Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods. In New developments in parsing technology. Dordrecht: Kluwer Academic (paper accompanied invited talk at IWPT 2001).
Collins, M., & Duffy, N. (2002). New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Annual meeting of the association for computational linguistics (ACL) (pp. 263–270).
Collobert, R., & Bengio, S. (2001). SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research (JMLR), 1, 143–160.
Cortes, C., & Vapnik, V. N. (1995). Support-vector networks. Machine Learning, 20, 273–297.
Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research (JMLR), 2, 265–292.
Crammer, K., & Singer, Y. (2003). Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research (JMLR), 3, 951–991.
Ferris, M., & Munson, T. (2003). Interior-point methods for massive support vector machines. SIAM Journal of Optimization, 13(3), 783–804.
Fukumizu, K., Bach, F., & Jordan, M. (2004). Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research (JMLR), 5, 73–99.
Fung, G., & Mangasarian, O. (2001). Proximal support vector classifiers. In ACM conference on knowledge discovery and data mining (KDD) (pp. 77–86).
Globerson, A., Koo, T. Y., Carreras, X., & Collins, M. (2007). Exponentiated gradient algorithm for log-linear structured prediction. In International conference on machine learning (ICML) (pp. 305–312).
Joachims, T. (1999). Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—support vector learning (pp. 169–184). Cambridge: MIT Press. Chap. 11.
Joachims, T. (2003). Learning to align sequences: A maximum-margin approach. Online manuscript.
Joachims, T. (2005). A support vector method for multivariate performance measures. In International conference on machine learning (ICML) (pp. 377–384).
Joachims, T. (2006). Training linear SVMs in linear time. In ACM SIGKDD international conference on knowledge discovery and data mining (KDD) (pp. 217–226).
Johnson, M. (1998). PCFG models of linguistic tree representations. Computational Linguistics, 24(4), 613–632.
Keerthi, S., & DeCoste, D. (2005). A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research (JMLR), 6, 341–361.
Keerthi, S., Chapelle, O., & DeCoste, D. (2006). Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research (JMLR), 7, 1493–1515.
Kivinen, J., & Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1), 1–63.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International conference on machine learning (ICML).
Lewis, D., Yang, Y., Rose, T., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research (JMLR), 5, 361–397.
Mangasarian, O., & Musicant, D. (2001). Lagrangian support vector machines. Journal of Machine Learning Research (JMLR), 1, 161–177.
Marcus, M., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
McDonald, R., Crammer, K., & Pereira, F. (2005). Online large-margin training of dependency parsers. In Annual meeting of the association for computational linguistics (ACL) (pp. 91–98).
Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—support vector learning. Cambridge: MIT Press. Chap 12.
Ratliff, N. D., Bagnell, J. A., & Zinkevich, M. A. (2007). (Online) subgradient methods for structured prediction. In Conference on artificial intelligence and statistics (AISTATS).
Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM. In International conference on machine learning (ICML) (pp. 807–814). New York: ACM.
Smola, A., & Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In International conference on machine learning (pp. 911–918).
Taskar, B., Guestrin, C., & Koller, D. (2003). Maximum-margin Markov networks. In Advances in neural information processing systems (NIPS).
Taskar, B., Klein, D., Collins, M., Koller, D., & Manning, C. (2004). Max-margin parsing. In Empirical methods in natural language processing (EMNLP).
Taskar, B., Lacoste-Julien, S., & Jordan, M. I. (2005). Structured prediction via the extragradient method. In Advances in neural information processing systems (NIPS).
Teo, C. H., Smola, A., Vishwanathan, S. V., & Le, Q. V. (2007). A scalable modular convex solver for regularized risk minimization. In ACM conference on knowledge discovery and data mining (KDD) (pp. 727–736).
Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In International conference on machine learning (ICML) (pp. 104–112).
Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR), 6, 1453–1484.
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Vishwanathan, S. V. N., Schraudolph, N. N., Schmidt, M. W., & Murphy, K. P. (2006). Accelerated training of conditional random fields with stochastic gradient methods. In International conference on machine learning (ICML) (pp. 969–976).
Yu, C. N., Joachims, T., Elber, R., & Pillardy, J. (2007). Support vector training of protein alignment models. In Proceeding of the international conference on research in computational molecular biology (RECOMB) (pp. 253–267).
Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (2007). A support vector method for optimizing average precision. In ACM SIGIR conference on research and development in information retrieval (SIGIR) (pp. 271–278).
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Tony Jebara.
Rights and permissions
About this article
Cite this article
Joachims, T., Finley, T. & Yu, CN.J. Cutting-plane training of structural SVMs. Mach Learn 77, 27–59 (2009). https://doi.org/10.1007/s10994-009-5108-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-009-5108-8