ABSTRACT
A key component of BM25 contributing to its success is its sub linear term frequency (TF) normalization formula. The scale and shape of this TF normalization component is controlled by a parameter k1, which is generally set to a term-independent constant. We hypothesize and show empirically that in order to optimize retrieval performance, this parameter should be set in a term-specific way. Following this intuition, we propose an information gain measure to directly estimate the contributions of repeated term occurrences, which is then exploited to fit the BM25 function to predict a term-specific k1. Our experiment results show that the proposed approach, without needing any training data, can efficiently and automatically estimate a term-specific k1, and is more effective and robust than the standard BM25.
- Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20:357--389, October 2002. Google ScholarDigital Library
- Kenneth W. Church and William A. Gale. Poisson mixtures. Natural Language Engineering, 1:163--190, 1995.Google ScholarCross Ref
- Hui Fang, Tao Tao, and ChengXiang Zhai. A formal study of information retrieval heuristics. In SIGIR '04, pages 49--56, 2004. Google ScholarDigital Library
- Ben He and Iadh Ounis. On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Trans. Inf. Syst., 25, July 2007. Google ScholarDigital Library
- Jaakko Hintikka. On Semantic Information. In J. Hintikka and P. Suppes, editors, Information and Inference, pages 3--27. D. Reidel Pub., 1970.Google ScholarCross Ref
- K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. In Information Processing and Management, pages 779--840, 2000. Google ScholarDigital Library
- Yuanhua Lv and ChengXiang Zhai. Lower-bounding term frequency normalization. In CIKM '11, 2011. Google ScholarDigital Library
- Yuanhua Lv and ChengXiang Zhai. When documents are very long, bm25 fails! In SIGIR '11, pages 1103--1104, 2011. Google ScholarDigital Library
- S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR '94, pages 232--241, 1994. Google ScholarDigital Library
- Stephen Robertson, Hugo Zaragoza, and Michael Taylor. Simple bm25 extension to multiple weighted fields. In CIKM '04, pages 42--49, 2004. Google ScholarDigital Library
- Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at trec-3. In TREC '94, pages 109--126, 1994.Google Scholar
- Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In SIGIR '96, pages 21--29, 1996. Google ScholarDigital Library
Index Terms
- Adaptive term frequency normalization for BM25
Recommendations
A log-logistic model-based interpretation of TF normalization of BM25
ECIR'12: Proceedings of the 34th European conference on Advances in Information RetrievalThe effectiveness of BM25 retrieval function is mainly due to its sub-linear term frequency (TF) normalization component, which is controlled by a parameter k1. Although BM25 was derived based on the classic probabilistic retrieval model, it has been so ...
BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies
Intelligent and Fuzzy Systems applied to Language & Knowledge EngineeringIn this paper, the use of collection term frequencies (i.e. the total number of occurrences of a term in a document collection) in the BM25 retrieval model is investigated by modifying its term frequency (TF) and inverse document frequency (IDF) ...
BM25 Beyond Query-Document Similarity
String Processing and Information RetrievalAbstractThe massive growth of information produced and shared online has made retrieving relevant documents a difficult task. Query Expansion (QE) based on term co-occurrence statistics has been widely applied in an attempt to improve retrieval ...
Comments