skip to main content
10.3115/1117794.1117809dlproceedingsArticle/Chapter ViewAbstractPublication PagesemnlpConference Proceedingsconference-collections
Article
Free Access

Empirical term weighting and expansion frequency

Published:07 October 2000Publication History

ABSTRACT

We propose an empirical method for estimating term weights directly from relevance judgments, avoiding various standard but potentially trouble-some assumptions. It is common to assume, for example, that weights vary with term frequency (tf) and inverse document frequency (idf) in a particular way, e.g., tf .idf, but the fact that there are so many variants of this formula in the literature suggests that there remains considerable uncertainty about these assumptions. Our method is similar to the Berkeley regression method where labeled relevance judgments are fit as a linear combination of (transforms of) tf, idf, etc. Training methods not only improve performance, but also extend naturally to include additional factors such as burstiness and query expansion. The proposed histogram-based training method provides a simple way to model complicated interactions among factors such as tf, idf, burstiness and expansion frequency (a generalization of query expansion). The correct handling of expanded term is realized based on statistical information. Expansion frequency dramatically improves performance from a level comparable to BKJJBIDS, Berkeley's entry in the Japanese NACSIS NTCIR-1 evaluation for short queries, to the level of JCB1, the top system in the evaluation. JCB1 uses sophisticated (and proprietary) natural language processing techniques developed by Just System, a leader in the Japanese word-processing industry. We are encouraged that the proposed method, which is simple to understand and replicate, can reach this level of performance.

References

  1. Chris Buckley, Gerard Salton, James Allan, and Amit Singhal. 1995. Automatic query expansion using smart: Trec 3. In The Third Text REtrieval Conference(TREC-3), pages 69--80.Google ScholarGoogle Scholar
  2. Aitao Chen, Fredric C. Gey, Kazuaki Kishida, Hailing Jiang, and Qun Liang. 1999. Comparing multiple methods for japanese and japanese-english text retrieval. In NTCIR Workshop 1, pages 49--58, Tokyo Japan, Sep.Google ScholarGoogle Scholar
  3. Kenneth W. Church and William A. Gale. 1995. Poisson mixture. Natural Language Engineering, 1(2):163--190.Google ScholarGoogle ScholarCross RefCross Ref
  4. Kenneth W. Church. 2000. Empirical estimates of adaptation: The chance of two noriegas is closer to p/2 than p2. In Coling-2000, pages 180--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. William S. Cooper, Aitao Chen, and Fredric C. Gey. 1994. Full text retrieval based on probabilistic equation with coefficients fitted by logistic regressions. In The Second Text REtrieval Conference(TREC-2), pages 57--66.Google ScholarGoogle Scholar
  6. Sumio Fujita. 1999. Notes on phrasal indexing: Jscb evaluation experiments at ntcir ad hoc". In NTCIR Workshop 1, pages 101--108, http://www.rd.nacsis.ac.jp/ ~ntcadm/, Sep.Google ScholarGoogle Scholar
  7. Noriko Kando, Kazuko Kuriyama, Toshihiko Nozue, Koji Eguchi, and Hiroyuki Katoand Souichiro Hidaka. 1999. Overview of ir tasks at the first ntcir workshop. In NTCIR Workshop 1, pages 11--44, http://www.rd.nacsis.ac.jp/ ~ntcadm/, Sep.Google ScholarGoogle Scholar
  8. Slava M. Katz. 1996. Distribution of content words and phrases in text and language modelling. Natural Language Engineering, 2(1):15--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. L. Kwok. 1996. A new method of weighting query terms for ad-hoc retrieval. In SIGIR96, pages 187--195, Zurich, Switzerland. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Osamu Imaichi, and Tomoaki Imamura. 1997. Japanese morphological analysis system chasen manual. Technical Report NAIST-IS-TR97007, NAIST, Nara, Japan, Feb.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image DL Hosted proceedings
    EMNLP '00: Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
    October 2000
    233 pages

    Publisher

    Association for Computational Linguistics

    United States

    Publication History

    • Published: 7 October 2000

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate73of234submissions,31%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader