skip to main content
10.3115/981863.981904dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
Article
Free Access

An empirical study of smoothing techniques for language modeling

Published:24 June 1996Publication History

ABSTRACT

We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and n-gram order (bigram versus trigram) affect the relative performance of these methods, which we measure through the cross-entropy of test data. In addition, we introduce two novel smoothing techniques, one a variation of Jelinek-Mercer smoothing and one a very simple linear interpolation technique, both of which outperform existing methods.

References

  1. Bahl, Lalit R., Frederick Jelinek, and Robert L. Mercer. 1983. A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(2):179--190, March.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Brown, Peter F., John Cocke, Stephen A. DellaPietra, Vincent J. DellaPietra, Frederick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. A statistical approach to machine translation. Computational Linguistics, 16(2):79--85, June. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Brown, Peter F., Stephen A. DellaPietra, Vincent J. DellaPietra, Jennifer C. Lai, and Robert L. Mercer. 1992. An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1):31--40, March. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chen, Stanley F. 1996. Building Probabilistic Models for Natural Language. Ph.D. thesis, Harvard University. In preparation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Church, Kenneth. 1988. A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the Second Conference on Applied Natural Language Processing, pages 136--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Church, Kenneth W. and William A. Gale. 1991. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5:19--54.Google ScholarGoogle ScholarCross RefCross Ref
  7. Collins, Michael and James Brooks. 1995. Prepositional phrase attachment through a backed-off model. In David Yarowsky and Kenneth Church, editors, Proceedings of the Third Workshop on Very Large Corpora, pages 27--38, Cambridge, MA, June.Google ScholarGoogle Scholar
  8. Gale, William A. and Kenneth W. Church. 1990. Estimation procedures for language context: poor estimates are worse than none. In COMPSTAT, Proceedings in Computational Statistics, 9th Symposium, pages 69--74, Dubrovnik, Yugoslavia, September.Google ScholarGoogle Scholar
  9. Gale, William A. and Kenneth W. Church. 1994. What's wrong with adding one? In N. Oostdijk and P. de Haan, editors, Corpus-Based Research into Language. Rodolpi, Amsterdam.Google ScholarGoogle Scholar
  10. Gale, William A. and Geoffrey Sampson. 1995. Good-Turing frequency estimation without tears. Journal of Quantitive Linguistics, 2(3). To appear.Google ScholarGoogle ScholarCross RefCross Ref
  11. Good, I. J. 1953. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3 and 4):237--264.Google ScholarGoogle Scholar
  12. Jeffreys, H. 1948. Theory of Probability. Clarendon Press, Oxford, second edition.Google ScholarGoogle Scholar
  13. Jelinek, Frederick and Robert L. Mercer. 1980. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice, Amsterdam, The Netherlands: North-Holland, May.Google ScholarGoogle Scholar
  14. Johnson, W. E. 1932. Probability: deductive and inductive problems. Mind, 41:421--423.Google ScholarGoogle Scholar
  15. Katz, Slava M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35(3):400--401, March.Google ScholarGoogle ScholarCross RefCross Ref
  16. Kernighan, M. D., K. W. Church, and W. A. Gale. 1990. A spelling correction program based on a noisy channel model. In Proceedings of the Thirteenth International Conference on Computational Linguistics, pages 205--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lidstone, G. J. 1920. Note on the general case of the Bayes-Laplace formula for inductive or a posteriori probabilities. Transactions of the Faculty of Actuaries, 8:182--192.Google ScholarGoogle Scholar
  18. MacKay, David J. C. and Linda C. Peto. 1995. A hierarchical Dirichlet language model. Natural Language Engineering, 1(3):1--19.Google ScholarGoogle ScholarCross RefCross Ref
  19. Magerman, David M. 1994. Natural Language Parsing as Statistical Pattern Recognition. Ph.D. thesis, Stanford University, February. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Nadas, Arthur. 1984. Estimation of probabilities in the language model of the IBM speech recognition system. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-32(4):859--861, August.Google ScholarGoogle ScholarCross RefCross Ref
  21. Press, W. H., B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. 1988. Numerical Recipes in C. Cambridge University Press, Cambridge.Google ScholarGoogle Scholar
  1. An empirical study of smoothing techniques for language modeling

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image DL Hosted proceedings
        ACL '96: Proceedings of the 34th annual meeting on Association for Computational Linguistics
        June 1996
        399 pages
        • Program Chairs:
        • Aravind Joshi,
        • Martha Palmer

        Publisher

        Association for Computational Linguistics

        United States

        Publication History

        • Published: 24 June 1996

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate85of443submissions,19%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader