ABSTRACT
In this paper, we reveal a common deficiency of the current retrieval models: the component of term frequency (TF) normalization by document length is not lower-bounded properly; as a result, very long documents tend to be overly penalized. In order to analytically diagnose this problem, we propose two desirable formal constraints to capture the heuristic of lower-bounding TF, and use constraint analysis to examine several representative retrieval functions. Analysis results show that all these retrieval functions can only satisfy the constraints for a certain range of parameter values and/or for a particular set of query terms. Empirical results further show that the retrieval performance tends to be poor when the parameter is out of the range or the query term is not in the particular set. To solve this common problem, we propose a general and efficient method to introduce a sufficiently large lower bound for TF normalization which can be shown analytically to fix or alleviate the problem. Our experimental results demonstrate that the proposed method, incurring almost no additional computational cost, can be applied to state-of-the-art retrieval functions, such as Okapi BM25, language models, and the divergence from randomness approach, to significantly improve the average precision, especially for verbose queries.
- G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20:357--389, October 2002. Google ScholarDigital Library
- S. Clinchant and E. Gaussier. Information-based models for ad hoc ir. In SIGIR '10, pages 234--241, 2010. Google ScholarDigital Library
- R. Cummins and C. O'Riordan. An axiomatic comparison of learned term-weighting schemes in information retrieval: clarifications and extensions. Artif. Intell. Rev., 28:51--68, June 2007. Google ScholarDigital Library
- H. Fang, T. Tao, and C. Zhai. A formal study of information retrieval heuristics. In SIGIR '04, pages 49--56, 2004. Google ScholarDigital Library
- H. Fang, T. Tao, and C. Zhai. Diagnostic evaluation of information retrieval models. ACM Trans. Inf. Syst., 29:7:1--7:42, April 2011. Google ScholarDigital Library
- H. Fang and C. Zhai. An exploration of axiomatic approaches to information retrieval. In SIGIR '05, pages 480--487, 2005. Google ScholarDigital Library
- H. Fang and C. Zhai. Semantic term matching in axiomatic approaches to information retrieval. In SIGIR '06, pages 115--122, 2006. Google ScholarDigital Library
- N. Fuhr. Probabilistic models in information retrieval. The Computer Journal, 35:243--255, 1992. Google ScholarDigital Library
- H. P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev., 1:309--317, October 1957. Google ScholarDigital Library
- Y. Lv and C. Zhai. Adaptive term frequency normalization for bm25. In CIKM '11, 2011. Google ScholarDigital Library
- Y. Lv and C. Zhai. When documents are very long, bm25 fails! In SIGIR '11, pages 1103--1104, 2011. Google ScholarDigital Library
- J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR '98, pages 275--281, 1998. Google ScholarDigital Library
- S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society of Information Science, 27(3):129--146, 1976.Google ScholarCross Ref
- S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR '94, pages 232--241, 1994. Google ScholarDigital Library
- S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. In TREC '94, pages 109--126, 1994.Google Scholar
- G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18(11):613--620, 1975. Google ScholarDigital Library
- A. Singhal. Modern information retrieval: a brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24:2001, 2001.Google Scholar
- A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In SIGIR '96, pages 21--29, 1996. Google ScholarDigital Library
- T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. In SIGIR '07, pages 295--302, 2007. Google ScholarDigital Library
- C. Zhai and J. D. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR '01, pages 334--342, 2001. Google ScholarDigital Library
Index Terms
- Lower-bounding term frequency normalization
Recommendations
Adaptive term frequency normalization for BM25
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementA key component of BM25 contributing to its success is its sub linear term frequency (TF) normalization formula. The scale and shape of this TF normalization component is controlled by a parameter k1, which is generally set to a term-independent ...
A study of parameter tuning for term frequency normalization
CIKM '03: Proceedings of the twelfth international conference on Information and knowledge managementMost current term frequency normalization approaches for information retrieval involve the use of parameters. The tuning of these parameters has an important impact on the overall performance of the information retrieval system. Indeed, a small ...
Term frequency – function of document frequency: a new term weighting scheme for enterprise information retrieval
In today's business environment, enterprises are increasingly under pressure to process the vast amount of data produced everyday within enterprises. One method is to focus on the business intelligence BI applications and increasing the commercial added-...
Comments