Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization

Kazama, Jun’ichi; Tsujii, Jun’ichi

doi:10.1007/s10994-005-0911-3

Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization

Published: September 2005

Volume 60, pages 159–194, (2005)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization

Download PDF

Jun’ichi Kazama³ &
Jun’ichi Tsujii^1,2

1318 Accesses
30 Citations
3 Altmetric
Explore all metrics

Abstract

Data sparseness or overfitting is a serious problem in natural language processing employing machine learning methods. This is still true even for the maximum entropy (ME) method, whose flexible modeling capability has alleviated data sparseness more successfully than the other probabilistic models in many NLP tasks. Although we usually estimate the model so that it completely satisfies the equality constraints on feature expectations with the ME method, complete satisfaction leads to undesirable overfitting, especially for sparse features, since the constraints derived from a limited amount of training data are always uncertain. To control overfitting in ME estimation, we propose the use of box-type inequality constraints, where equality can be violated up to certain predefined levels that reflect this uncertainty. The derived models, inequality ME models, in effect have regularized estimation with L₁ norm penalties of bounded parameters. Most importantly, this regularized estimation enables the model parameters to become sparse. This can be thought of as automatic feature selection, which is expected to improve generalization performance further. We evaluate the inequality ME models on text categorization datasets, and demonstrate their advantages over standard ME estimation, similarly motivated Gaussian MAP estimation of ME models, and support vector machines (SVMs), which are one of the state-of-the-art methods for text categorization.

Article PDF

A discriminative model selection approach and its application to text classification

Article 15 July 2017

Discriminative learning of generative models: large margin multinomial mixture models for document classification

Article 06 June 2014

The Effectiveness of the Max Entropy Classifier for Feature Selection

References

Benson, S., McInnes, L. C., Moré, J. J., & Sarich, J. (2002). TAO users manual. Technical Report ANL/MCS-TM-242-Revision 1.4, Argonne National Laboratory.
Benson, S. J., & Moré, J. J. (2001). A limited memory variable metric method for bound constraint minimization. Technical Report ANL/MCS-P909-0901, Argonne National Laboratory.
Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22:1, 39–71.
Google Scholar
Bertsekas, D. P. (1999). Nonlinear programming, 2nd edition. Athena Scientific.
Borthwick, A. (1999). A Maximum entropy approach to named entity recognition. Ph.D. Thesis. New York University.
Chen, S. F., & Rosenfeld, R. (1999). A gaussian prior for smoothing maximum entropy models. Technical Report CMU-CS-99-108, Carnegie Mellon University.
Chen, S.F & Rosenfeld, R (2000). A survey of smoothing techniques for ME models. IEEE Transactions Speech and Audio Processing, 8:1, 37–50.
Article Google Scholar
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press.
Curran, J. R., & Clark, S. (2003). Investigating GIS and smoothing for maximum entropy taggers. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL) (pp. 91–98).
Darroch, J.N & Ratcliff, D(1972).Generalized iterative scaling for log-linear models. The Annals ofMathematical Statistics, 43, 1470–1480.
Google Scholar
Della Pietra, S., Della Pietra, V. J., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:4, 380–393.
Article Google Scholar
Fang, S.-C., Rajasekera, J. R., & Tsao, H.-S. J. (1997). Entropy optimization and mathematical programming. Kluwer Academic Publishers.
Goodman, J. (2003). Exponential priors for maximum entropy models. Technical Report MSR-TR-coming soon. From http://research.microsoft.com~joshuago/ (as of September 25, 2003).
Goodman, J. (2004). Exponential priors for maximum entropy models. In Proceedings of the HLT-NAACL 2004 (pp. 305–311).
Hersh, W., Buckley, C., Leone, T., & Hickam, D. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the 17th Annual ACM SIGIR Conference (pp. 192–201).
Joachims, T. (1998a). Making large-scale support vector machine learning practical. In Advances in Kernel Methods (pp. 169–184). The MIT Press.
Joachims, T. (1998b). Text categorization with support vector machines. In Proceedings of the 10th European Conference on Machine Learning (ECML) (pp. 137–142).
Johnson, M., Geman, S., Canon, S., Chi, Z., & Riezler, S. (1999). Estimators for stochastic “unification-based” Grammars.
Johnson, M. & Riezler, S. (2000). Exploiting auxiliary distributions in stochastic unification-based grammars. In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics (NAACL) (pp. 154–161).
Kazama, J. (2004). Improving maximum entropy natural language processing by uncertainty-aware extensions and unsupervised learning. Ph.D. thesis, University of Tokyo.
Kazama, J., & Tsujii, J. (2003). Evaluation and extensions of maximum entropy models with inequality constraints. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 137–144).
Khudanpur, S. (1995). A method of ME estimation with relaxed constraints. In Proceedings of Johns Hopkins University Language Modeling Workshop (pp. 1–17).
Lau, R. (1994). Adaptive statistical language modeling. Master’s Thesis. MIT.
Lewis, D. D. (1992). Feature selection and feature extraction for text categorization. In Proceedings of Speech and Natural Language Workshop (pp. 212–217).
Malouf, R. (2002). A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of the 6th Conference on Natural Language Learning (CoNLL) (pp. 49–55).
McCallum, A. (2003) Efficiently inducing features of conditional random fields. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI).
Minka, T. P. (2001). Algorithms for maximum-likelihood logistic regression. CMU Statistics Tech Report 758.
Miyao, Y., & Tsujii, J. (2002). Maximum entropy estimation for feature forests. In Proceedings of Human Language Technology Conference (HLT) 2002.
Mladenic, D., & Globelnik, M. (1998). Word sequences as features in text learning. In Proceedings of the 17th Electrotechnical and Computer Science Conference (ERK98).
Newman, W (1977). Extension to the ME method. IEEE Transactions on Information Theory, Vol. IT-23, 89–93.
Article Google Scholar
Nigam, K., John, L., & McCallum, A. (1999). Using maximum entropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering (pp. 61–67).
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 79–86).
Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 133–142).
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management 24:5.
Google Scholar
Scott, S., & Matwin, S. (1999). Feature engineering for text classification. In Proceedingsof the 16th International Conference on Machine Learning (ICML) (pp. 379–388).
Sha, F., & Pereira, F. (2003). Shallow parsing with conditional random fields. In Proceedings of the HLT-NAACL 2003 (pp. 134–141).
Shirai, K., Inui, K., Tokunaga, T., & Tanaka, H. (1998). Saidai entoropii moderu ni yoru kakuritu moderu no parameta suitei ni yuukou na sosei no sentaku ni tuite (On the selection of useful features for estimating probabilistic models based on the maximum entropy method). In Proceedings of 4th Annual Meeting of Natural Language Processing (pp. 356–359). (in Japanese).
Tan, C, Wang, Y., & Lee, C (2002). The use of bigrams to enhance text categorization. Journal Information Processing Management, 30:4, 529–546.
Article Google Scholar
Vapnik, V. (1995). The nature of statistical learning theory. Springer Verlag.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML) (pp. 412–420).
Zhou, Y., Weng, F., Wu, L., & Schmidt, H. (2003). A Fast Algorithm for Feature Selection in Conditional Maximum Entropy Modeling. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 153–159).

Download references

Author information

Authors and Affiliations

Department of Computer Science, Faculty of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo, 113-0033, Japan
Jun’ichi Tsujii
CREST, JST (Japan Science and Technology Agency), 4-1-8, Honcho, Kawaguchi, Saitama, 332-0012, Japan
Jun’ichi Tsujii
School of Information Science, Japan Advanced Institute of Science and Technology (JAIST), 1-1 Asahidai, Tatsunokuchi, Noumi, Ishikawa, 923-1292, Japan
Jun’ichi Kazama

Authors

Jun’ichi Kazama
View author publications
You can also search for this author in PubMed Google Scholar
Jun’ichi Tsujii
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun’ichi Kazama.

Additional information

Editors:

Dan Roth and Pascale Fung

This article is an elaborated version of (Kazama and Tsujii 2003). Several new experimental results have also been added.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kazama, J., Tsujii, J. Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization. Mach Learn 60, 159–194 (2005). https://doi.org/10.1007/s10994-005-0911-3

Download citation

Received: 08 October 2003
Revised: 14 May 2004
Accepted: 24 May 2004
Issue Date: September 2005
DOI: https://doi.org/10.1007/s10994-005-0911-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization

Abstract

Article PDF

Similar content being viewed by others

A discriminative model selection approach and its application to text classification

Discriminative learning of generative models: large margin multinomial mixture models for document classification

The Effectiveness of the Max Entropy Classifier for Feature Selection

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Editors:

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization

Abstract

Article PDF

Similar content being viewed by others

A discriminative model selection approach and its application to text classification

Discriminative learning of generative models: large margin multinomial mixture models for document classification

The Effectiveness of the Max Entropy Classifier for Feature Selection

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Editors:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation