ABSTRACT
We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and n-gram order (bigram versus trigram) affect the relative performance of these methods, which we measure through the cross-entropy of test data. In addition, we introduce two novel smoothing techniques, one a variation of Jelinek-Mercer smoothing and one a very simple linear interpolation technique, both of which outperform existing methods.
- Bahl, Lalit R., Frederick Jelinek, and Robert L. Mercer. 1983. A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(2):179--190, March.Google ScholarDigital Library
- Brown, Peter F., John Cocke, Stephen A. DellaPietra, Vincent J. DellaPietra, Frederick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. A statistical approach to machine translation. Computational Linguistics, 16(2):79--85, June. Google ScholarDigital Library
- Brown, Peter F., Stephen A. DellaPietra, Vincent J. DellaPietra, Jennifer C. Lai, and Robert L. Mercer. 1992. An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1):31--40, March. Google ScholarDigital Library
- Chen, Stanley F. 1996. Building Probabilistic Models for Natural Language. Ph.D. thesis, Harvard University. In preparation. Google ScholarDigital Library
- Church, Kenneth. 1988. A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the Second Conference on Applied Natural Language Processing, pages 136--143. Google ScholarDigital Library
- Church, Kenneth W. and William A. Gale. 1991. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5:19--54.Google ScholarCross Ref
- Collins, Michael and James Brooks. 1995. Prepositional phrase attachment through a backed-off model. In David Yarowsky and Kenneth Church, editors, Proceedings of the Third Workshop on Very Large Corpora, pages 27--38, Cambridge, MA, June.Google Scholar
- Gale, William A. and Kenneth W. Church. 1990. Estimation procedures for language context: poor estimates are worse than none. In COMPSTAT, Proceedings in Computational Statistics, 9th Symposium, pages 69--74, Dubrovnik, Yugoslavia, September.Google Scholar
- Gale, William A. and Kenneth W. Church. 1994. What's wrong with adding one? In N. Oostdijk and P. de Haan, editors, Corpus-Based Research into Language. Rodolpi, Amsterdam.Google Scholar
- Gale, William A. and Geoffrey Sampson. 1995. Good-Turing frequency estimation without tears. Journal of Quantitive Linguistics, 2(3). To appear.Google ScholarCross Ref
- Good, I. J. 1953. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3 and 4):237--264.Google Scholar
- Jeffreys, H. 1948. Theory of Probability. Clarendon Press, Oxford, second edition.Google Scholar
- Jelinek, Frederick and Robert L. Mercer. 1980. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice, Amsterdam, The Netherlands: North-Holland, May.Google Scholar
- Johnson, W. E. 1932. Probability: deductive and inductive problems. Mind, 41:421--423.Google Scholar
- Katz, Slava M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35(3):400--401, March.Google ScholarCross Ref
- Kernighan, M. D., K. W. Church, and W. A. Gale. 1990. A spelling correction program based on a noisy channel model. In Proceedings of the Thirteenth International Conference on Computational Linguistics, pages 205--210. Google ScholarDigital Library
- Lidstone, G. J. 1920. Note on the general case of the Bayes-Laplace formula for inductive or a posteriori probabilities. Transactions of the Faculty of Actuaries, 8:182--192.Google Scholar
- MacKay, David J. C. and Linda C. Peto. 1995. A hierarchical Dirichlet language model. Natural Language Engineering, 1(3):1--19.Google ScholarCross Ref
- Magerman, David M. 1994. Natural Language Parsing as Statistical Pattern Recognition. Ph.D. thesis, Stanford University, February. Google ScholarDigital Library
- Nadas, Arthur. 1984. Estimation of probabilities in the language model of the IBM speech recognition system. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-32(4):859--861, August.Google ScholarCross Ref
- Press, W. H., B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. 1988. Numerical Recipes in C. Cambridge University Press, Cambridge.Google Scholar
- An empirical study of smoothing techniques for language modeling
Recommendations
On an Empirical Study of Smoothing Techniques for a Tiny Language Model
IPAC '15: Proceedings of the International Conference on Intelligent Information Processing, Security and Advanced CommunicationThe language models (LM) are an important module in many areas of natural language processing, in particular speech recognition and machine translation. In this experimental work, we present the most popular smoothing methods and their effects on ...
An empirical study of smoothing techniques for language modeling
We survey the most widely-used algorithms for smoothing models for language n -gram modeling. We then present an extensive empirical comparison of several of these smoothing techniques, including those described by Jelinek and Mercer (1980); Katz (1987);...
Cooccurrence smoothing for stochastic language modeling
ICASSP'92: Proceedings of the 1992 IEEE international conference on Acoustics, speech and signal processing - Volume 1Training corpora for stochastic language models are virtually always too small for maximum-likelihood estimation, so smoothing the models is of great importance. This paper derives the cooccurrence smoothing technique for stochastic language modeling ...
Comments