Article

Free Access

Empirical term weighting and expansion frequency

Authors:
Kyoji Umemura

Toyohashi University of Technology, Toyohashi Aichi, Japan

Toyohashi University of Technology, Toyohashi Aichi, Japan
View Profile

,
Kenneth W. Church

AT&T Labs-Research, Florham Park, NJ

AT&T Labs-Research, Florham Park, NJ
View Profile

EMNLP '00: Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13October 2000Pages 117–123https://doi.org/10.3115/1117794.1117809

Published:07 October 2000Publication History

EMNLP '00: Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13

Pages 117–123

ABSTRACT

We propose an empirical method for estimating term weights directly from relevance judgments, avoiding various standard but potentially trouble-some assumptions. It is common to assume, for example, that weights vary with term frequency (tf) and inverse document frequency (idf) in a particular way, e.g., tf .idf, but the fact that there are so many variants of this formula in the literature suggests that there remains considerable uncertainty about these assumptions. Our method is similar to the Berkeley regression method where labeled relevance judgments are fit as a linear combination of (transforms of) tf, idf, etc. Training methods not only improve performance, but also extend naturally to include additional factors such as burstiness and query expansion. The proposed histogram-based training method provides a simple way to model complicated interactions among factors such as tf, idf, burstiness and expansion frequency (a generalization of query expansion). The correct handling of expanded term is realized based on statistical information. Expansion frequency dramatically improves performance from a level comparable to BKJJBIDS, Berkeley's entry in the Japanese NACSIS NTCIR-1 evaluation for short queries, to the level of JCB1, the top system in the evaluation. JCB1 uses sophisticated (and proprietary) natural language processing techniques developed by Just System, a leader in the Japanese word-processing industry. We are encouraged that the proposed method, which is simple to understand and replicate, can reach this level of performance.

References

Chris Buckley, Gerard Salton, James Allan, and Amit Singhal. 1995. Automatic query expansion using smart: Trec 3. In The Third Text REtrieval Conference(TREC-3), pages 69--80.Google Scholar
Aitao Chen, Fredric C. Gey, Kazuaki Kishida, Hailing Jiang, and Qun Liang. 1999. Comparing multiple methods for japanese and japanese-english text retrieval. In NTCIR Workshop 1, pages 49--58, Tokyo Japan, Sep.Google Scholar
Kenneth W. Church and William A. Gale. 1995. Poisson mixture. Natural Language Engineering, 1(2):163--190.Google ScholarCross Ref
Kenneth W. Church. 2000. Empirical estimates of adaptation: The chance of two noriegas is closer to p/2 than p2. In Coling-2000, pages 180--186. Google ScholarDigital Library
William S. Cooper, Aitao Chen, and Fredric C. Gey. 1994. Full text retrieval based on probabilistic equation with coefficients fitted by logistic regressions. In The Second Text REtrieval Conference(TREC-2), pages 57--66.Google Scholar
Sumio Fujita. 1999. Notes on phrasal indexing: Jscb evaluation experiments at ntcir ad hoc". In NTCIR Workshop 1, pages 101--108, http://www.rd.nacsis.ac.jp/ ~ntcadm/, Sep.Google Scholar
Noriko Kando, Kazuko Kuriyama, Toshihiko Nozue, Koji Eguchi, and Hiroyuki Katoand Souichiro Hidaka. 1999. Overview of ir tasks at the first ntcir workshop. In NTCIR Workshop 1, pages 11--44, http://www.rd.nacsis.ac.jp/ ~ntcadm/, Sep.Google Scholar
Slava M. Katz. 1996. Distribution of content words and phrases in text and language modelling. Natural Language Engineering, 2(1):15--59. Google ScholarDigital Library
K. L. Kwok. 1996. A new method of weighting query terms for ad-hoc retrieval. In SIGIR96, pages 187--195, Zurich, Switzerland. Google ScholarDigital Library
Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Osamu Imaichi, and Tomoaki Imamura. 1997. Japanese morphological analysis system chasen manual. Technical Report NAIST-IS-TR97007, NAIST, Nara, Japan, Feb.Google Scholar

Recommendations

Balancing between over-weighting and under-weighting in supervised term weighting

Show the importance of the trade-off between over-weighting and under-weighting.Propose a revision of add-one smoothing on delta smoothed idf (dsidf).Present three regularization techniques to reduce over-weighting.Propose a new supervised term ...
Read More
Multi term based co-term frequency method for term weighting in information retrieval

Nowadays, World Wide Web WWW has become the only source of all kind of information. Retrieving the relevant web pages based on user queries from WWW is an exigent task. Term frequency inverse document frequency TF-IDF is the most frequently used method ...
Read More
A novel Fuzzy-PSO term weighting automatic query expansion approach using combined semantic filtering

Information Retrieval system retrieves relevant documents from large datasets. Automatic Query Expansion (AQE) is one of the approaches to enhance IR performance by adding additional terms to original query. The selection of suitable additional terms ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EMNLP '00: Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
October 2000
233 pages
Conference Chairs:
Hinrich Schiitze
GroupFire Inc
,
Keh-Yih Su
Behavior Design Corporation
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 7 October 2000
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate73of234submissions,31%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 266
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Empirical term weighting and expansion frequency

EMNLP '00: Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13

ABSTRACT

References

Cited By

Recommendations

Balancing between over-weighting and under-weighting in supervised term weighting

Multi term based co-term frequency method for term weighting in information retrieval

A novel Fuzzy-PSO term weighting automatic query expansion approach using combined semantic filtering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Empirical term weighting and expansion frequency

EMNLP '00: Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13

ABSTRACT

References

Cited By

Recommendations

Balancing between over-weighting and under-weighting in supervised term weighting

Multi term based co-term frequency method for term weighting in information retrieval

A novel Fuzzy-PSO term weighting automatic query expansion approach using combined semantic filtering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media