Abstract
Data sparseness or overfitting is a serious problem in natural language processing employing machine learning methods. This is still true even for the maximum entropy (ME) method, whose flexible modeling capability has alleviated data sparseness more successfully than the other probabilistic models in many NLP tasks. Although we usually estimate the model so that it completely satisfies the equality constraints on feature expectations with the ME method, complete satisfaction leads to undesirable overfitting, especially for sparse features, since the constraints derived from a limited amount of training data are always uncertain. To control overfitting in ME estimation, we propose the use of box-type inequality constraints, where equality can be violated up to certain predefined levels that reflect this uncertainty. The derived models, inequality ME models, in effect have regularized estimation with L1 norm penalties of bounded parameters. Most importantly, this regularized estimation enables the model parameters to become sparse. This can be thought of as automatic feature selection, which is expected to improve generalization performance further. We evaluate the inequality ME models on text categorization datasets, and demonstrate their advantages over standard ME estimation, similarly motivated Gaussian MAP estimation of ME models, and support vector machines (SVMs), which are one of the state-of-the-art methods for text categorization.
Article PDF
Similar content being viewed by others
References
Benson, S., McInnes, L. C., Moré, J. J., & Sarich, J. (2002). TAO users manual. Technical Report ANL/MCS-TM-242-Revision 1.4, Argonne National Laboratory.
Benson, S. J., & Moré, J. J. (2001). A limited memory variable metric method for bound constraint minimization. Technical Report ANL/MCS-P909-0901, Argonne National Laboratory.
Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22:1, 39–71.
Bertsekas, D. P. (1999). Nonlinear programming, 2nd edition. Athena Scientific.
Borthwick, A. (1999). A Maximum entropy approach to named entity recognition. Ph.D. Thesis. New York University.
Chen, S. F., & Rosenfeld, R. (1999). A gaussian prior for smoothing maximum entropy models. Technical Report CMU-CS-99-108, Carnegie Mellon University.
Chen, S.F & Rosenfeld, R (2000). A survey of smoothing techniques for ME models. IEEE Transactions Speech and Audio Processing, 8:1, 37–50.
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press.
Curran, J. R., & Clark, S. (2003). Investigating GIS and smoothing for maximum entropy taggers. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL) (pp. 91–98).
Darroch, J.N & Ratcliff, D(1972).Generalized iterative scaling for log-linear models. The Annals ofMathematical Statistics, 43, 1470–1480.
Della Pietra, S., Della Pietra, V. J., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:4, 380–393.
Fang, S.-C., Rajasekera, J. R., & Tsao, H.-S. J. (1997). Entropy optimization and mathematical programming. Kluwer Academic Publishers.
Goodman, J. (2003). Exponential priors for maximum entropy models. Technical Report MSR-TR-coming soon. From http://research.microsoft.com~joshuago/ (as of September 25, 2003).
Goodman, J. (2004). Exponential priors for maximum entropy models. In Proceedings of the HLT-NAACL 2004 (pp. 305–311).
Hersh, W., Buckley, C., Leone, T., & Hickam, D. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the 17th Annual ACM SIGIR Conference (pp. 192–201).
Joachims, T. (1998a). Making large-scale support vector machine learning practical. In Advances in Kernel Methods (pp. 169–184). The MIT Press.
Joachims, T. (1998b). Text categorization with support vector machines. In Proceedings of the 10th European Conference on Machine Learning (ECML) (pp. 137–142).
Johnson, M., Geman, S., Canon, S., Chi, Z., & Riezler, S. (1999). Estimators for stochastic “unification-based” Grammars.
Johnson, M. & Riezler, S. (2000). Exploiting auxiliary distributions in stochastic unification-based grammars. In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics (NAACL) (pp. 154–161).
Kazama, J. (2004). Improving maximum entropy natural language processing by uncertainty-aware extensions and unsupervised learning. Ph.D. thesis, University of Tokyo.
Kazama, J., & Tsujii, J. (2003). Evaluation and extensions of maximum entropy models with inequality constraints. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 137–144).
Khudanpur, S. (1995). A method of ME estimation with relaxed constraints. In Proceedings of Johns Hopkins University Language Modeling Workshop (pp. 1–17).
Lau, R. (1994). Adaptive statistical language modeling. Master’s Thesis. MIT.
Lewis, D. D. (1992). Feature selection and feature extraction for text categorization. In Proceedings of Speech and Natural Language Workshop (pp. 212–217).
Malouf, R. (2002). A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of the 6th Conference on Natural Language Learning (CoNLL) (pp. 49–55).
McCallum, A. (2003) Efficiently inducing features of conditional random fields. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI).
Minka, T. P. (2001). Algorithms for maximum-likelihood logistic regression. CMU Statistics Tech Report 758.
Miyao, Y., & Tsujii, J. (2002). Maximum entropy estimation for feature forests. In Proceedings of Human Language Technology Conference (HLT) 2002.
Mladenic, D., & Globelnik, M. (1998). Word sequences as features in text learning. In Proceedings of the 17th Electrotechnical and Computer Science Conference (ERK98).
Newman, W (1977). Extension to the ME method. IEEE Transactions on Information Theory, Vol. IT-23, 89–93.
Nigam, K., John, L., & McCallum, A. (1999). Using maximum entropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering (pp. 61–67).
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 79–86).
Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 133–142).
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management 24:5.
Scott, S., & Matwin, S. (1999). Feature engineering for text classification. In Proceedingsof the 16th International Conference on Machine Learning (ICML) (pp. 379–388).
Sha, F., & Pereira, F. (2003). Shallow parsing with conditional random fields. In Proceedings of the HLT-NAACL 2003 (pp. 134–141).
Shirai, K., Inui, K., Tokunaga, T., & Tanaka, H. (1998). Saidai entoropii moderu ni yoru kakuritu moderu no parameta suitei ni yuukou na sosei no sentaku ni tuite (On the selection of useful features for estimating probabilistic models based on the maximum entropy method). In Proceedings of 4th Annual Meeting of Natural Language Processing (pp. 356–359). (in Japanese).
Tan, C, Wang, Y., & Lee, C (2002). The use of bigrams to enhance text categorization. Journal Information Processing Management, 30:4, 529–546.
Vapnik, V. (1995). The nature of statistical learning theory. Springer Verlag.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML) (pp. 412–420).
Zhou, Y., Weng, F., Wu, L., & Schmidt, H. (2003). A Fast Algorithm for Feature Selection in Conditional Maximum Entropy Modeling. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 153–159).
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors:
Dan Roth and Pascale Fung
This article is an elaborated version of (Kazama and Tsujii 2003). Several new experimental results have also been added.
Rights and permissions
About this article
Cite this article
Kazama, J., Tsujii, J. Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization. Mach Learn 60, 159–194 (2005). https://doi.org/10.1007/s10994-005-0911-3
Received:
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10994-005-0911-3