ABSTRACT
We propose an empirical method for estimating term weights directly from relevance judgments, avoiding various standard but potentially trouble-some assumptions. It is common to assume, for example, that weights vary with term frequency (tf) and inverse document frequency (idf) in a particular way, e.g., tf .idf, but the fact that there are so many variants of this formula in the literature suggests that there remains considerable uncertainty about these assumptions. Our method is similar to the Berkeley regression method where labeled relevance judgments are fit as a linear combination of (transforms of) tf, idf, etc. Training methods not only improve performance, but also extend naturally to include additional factors such as burstiness and query expansion. The proposed histogram-based training method provides a simple way to model complicated interactions among factors such as tf, idf, burstiness and expansion frequency (a generalization of query expansion). The correct handling of expanded term is realized based on statistical information. Expansion frequency dramatically improves performance from a level comparable to BKJJBIDS, Berkeley's entry in the Japanese NACSIS NTCIR-1 evaluation for short queries, to the level of JCB1, the top system in the evaluation. JCB1 uses sophisticated (and proprietary) natural language processing techniques developed by Just System, a leader in the Japanese word-processing industry. We are encouraged that the proposed method, which is simple to understand and replicate, can reach this level of performance.
- Chris Buckley, Gerard Salton, James Allan, and Amit Singhal. 1995. Automatic query expansion using smart: Trec 3. In The Third Text REtrieval Conference(TREC-3), pages 69--80.Google Scholar
- Aitao Chen, Fredric C. Gey, Kazuaki Kishida, Hailing Jiang, and Qun Liang. 1999. Comparing multiple methods for japanese and japanese-english text retrieval. In NTCIR Workshop 1, pages 49--58, Tokyo Japan, Sep.Google Scholar
- Kenneth W. Church and William A. Gale. 1995. Poisson mixture. Natural Language Engineering, 1(2):163--190.Google ScholarCross Ref
- Kenneth W. Church. 2000. Empirical estimates of adaptation: The chance of two noriegas is closer to p/2 than p2. In Coling-2000, pages 180--186. Google ScholarDigital Library
- William S. Cooper, Aitao Chen, and Fredric C. Gey. 1994. Full text retrieval based on probabilistic equation with coefficients fitted by logistic regressions. In The Second Text REtrieval Conference(TREC-2), pages 57--66.Google Scholar
- Sumio Fujita. 1999. Notes on phrasal indexing: Jscb evaluation experiments at ntcir ad hoc". In NTCIR Workshop 1, pages 101--108, http://www.rd.nacsis.ac.jp/ ~ntcadm/, Sep.Google Scholar
- Noriko Kando, Kazuko Kuriyama, Toshihiko Nozue, Koji Eguchi, and Hiroyuki Katoand Souichiro Hidaka. 1999. Overview of ir tasks at the first ntcir workshop. In NTCIR Workshop 1, pages 11--44, http://www.rd.nacsis.ac.jp/ ~ntcadm/, Sep.Google Scholar
- Slava M. Katz. 1996. Distribution of content words and phrases in text and language modelling. Natural Language Engineering, 2(1):15--59. Google ScholarDigital Library
- K. L. Kwok. 1996. A new method of weighting query terms for ad-hoc retrieval. In SIGIR96, pages 187--195, Zurich, Switzerland. Google ScholarDigital Library
- Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Osamu Imaichi, and Tomoaki Imamura. 1997. Japanese morphological analysis system chasen manual. Technical Report NAIST-IS-TR97007, NAIST, Nara, Japan, Feb.Google Scholar
Recommendations
Balancing between over-weighting and under-weighting in supervised term weighting
Show the importance of the trade-off between over-weighting and under-weighting.Propose a revision of add-one smoothing on delta smoothed idf (dsidf).Present three regularization techniques to reduce over-weighting.Propose a new supervised term ...
Multi term based co-term frequency method for term weighting in information retrieval
Nowadays, World Wide Web WWW has become the only source of all kind of information. Retrieving the relevant web pages based on user queries from WWW is an exigent task. Term frequency inverse document frequency TF-IDF is the most frequently used method ...
A novel Fuzzy-PSO term weighting automatic query expansion approach using combined semantic filtering
Information Retrieval system retrieves relevant documents from large datasets. Automatic Query Expansion (AQE) is one of the approaches to enhance IR performance by adding additional terms to original query. The selection of suitable additional terms ...
Comments