poster

Adaptive term frequency normalization for BM25

Authors:
Yuanhua Lv

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
ChengXiang Zhai

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementOctober 2011Pages 1985–1988https://doi.org/10.1145/2063576.2063871

Published:24 October 2011Publication History

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Pages 1985–1988

ABSTRACT

A key component of BM25 contributing to its success is its sub linear term frequency (TF) normalization formula. The scale and shape of this TF normalization component is controlled by a parameter k1, which is generally set to a term-independent constant. We hypothesize and show empirically that in order to optimize retrieval performance, this parameter should be set in a term-specific way. Following this intuition, we propose an information gain measure to directly estimate the contributions of repeated term occurrences, which is then exploited to fit the BM25 function to predict a term-specific k1. Our experiment results show that the proposed approach, without needing any training data, can efficiently and automatically estimate a term-specific k1, and is more effective and robust than the standard BM25.

References

Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20:357--389, October 2002. Google ScholarDigital Library
Kenneth W. Church and William A. Gale. Poisson mixtures. Natural Language Engineering, 1:163--190, 1995.Google ScholarCross Ref
Hui Fang, Tao Tao, and ChengXiang Zhai. A formal study of information retrieval heuristics. In SIGIR '04, pages 49--56, 2004. Google ScholarDigital Library
Ben He and Iadh Ounis. On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Trans. Inf. Syst., 25, July 2007. Google ScholarDigital Library
Jaakko Hintikka. On Semantic Information. In J. Hintikka and P. Suppes, editors, Information and Inference, pages 3--27. D. Reidel Pub., 1970.Google ScholarCross Ref
K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. In Information Processing and Management, pages 779--840, 2000. Google ScholarDigital Library
Yuanhua Lv and ChengXiang Zhai. Lower-bounding term frequency normalization. In CIKM '11, 2011. Google ScholarDigital Library
Yuanhua Lv and ChengXiang Zhai. When documents are very long, bm25 fails! In SIGIR '11, pages 1103--1104, 2011. Google ScholarDigital Library
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR '94, pages 232--241, 1994. Google ScholarDigital Library
Stephen Robertson, Hugo Zaragoza, and Michael Taylor. Simple bm25 extension to multiple weighted fields. In CIKM '04, pages 42--49, 2004. Google ScholarDigital Library
Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at trec-3. In TREC '94, pages 109--126, 1994.Google Scholar
Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In SIGIR '96, pages 21--29, 1996. Google ScholarDigital Library

Index Terms

Adaptive term frequency normalization for BM25
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

A log-logistic model-based interpretation of TF normalization of BM25
ECIR'12: Proceedings of the 34th European conference on Advances in Information Retrieval

The effectiveness of BM25 retrieval function is mainly due to its sub-linear term frequency (TF) normalization component, which is controlled by a parameter k₁. Although BM25 was derived based on the classic probabilistic retrieval model, it has been so ...
Read More
BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies
Intelligent and Fuzzy Systems applied to Language & Knowledge Engineering

In this paper, the use of collection term frequencies (i.e. the total number of occurrences of a term in a document collection) in the BM25 retrieval model is investigated by modifying its term frequency (TF) and inverse document frequency (IDF) ...
Read More
BM25 Beyond Query-Document Similarity
String Processing and Information Retrieval
Abstract
The massive growth of information produced and shared online has made retrieving relevant documents a difficult task. Query Expansion (QE) based on term co-occurrence statistics has been widely applied in an attempt to improve retrieval ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
October 2011
2712 pages
ISBN:9781450307178
DOI:10.1145/2063576
Editors:
Bettina Berendt,
Arjen de Vries,
Wenfei Fan,
Craig Macdonald
University of Glasgow, UK
,
Iadh Ounis
University of Glasgow, UK
,
Ian Ruthven
University of Strathclyde, UK
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 October 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adaptation
bm25
information gain
term frequency
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 466
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Adaptive term frequency normalization for BM25

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

A log-logistic model-based interpretation of TF normalization of BM25

BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies

BM25 Beyond Query-Document Similarity