skip to main content
10.1145/2063576.2063871acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

Adaptive term frequency normalization for BM25

Authors Info & Claims
Published:24 October 2011Publication History

ABSTRACT

A key component of BM25 contributing to its success is its sub linear term frequency (TF) normalization formula. The scale and shape of this TF normalization component is controlled by a parameter k1, which is generally set to a term-independent constant. We hypothesize and show empirically that in order to optimize retrieval performance, this parameter should be set in a term-specific way. Following this intuition, we propose an information gain measure to directly estimate the contributions of repeated term occurrences, which is then exploited to fit the BM25 function to predict a term-specific k1. Our experiment results show that the proposed approach, without needing any training data, can efficiently and automatically estimate a term-specific k1, and is more effective and robust than the standard BM25.

References

  1. Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20:357--389, October 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Kenneth W. Church and William A. Gale. Poisson mixtures. Natural Language Engineering, 1:163--190, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  3. Hui Fang, Tao Tao, and ChengXiang Zhai. A formal study of information retrieval heuristics. In SIGIR '04, pages 49--56, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ben He and Iadh Ounis. On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Trans. Inf. Syst., 25, July 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jaakko Hintikka. On Semantic Information. In J. Hintikka and P. Suppes, editors, Information and Inference, pages 3--27. D. Reidel Pub., 1970.Google ScholarGoogle ScholarCross RefCross Ref
  6. K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. In Information Processing and Management, pages 779--840, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yuanhua Lv and ChengXiang Zhai. Lower-bounding term frequency normalization. In CIKM '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yuanhua Lv and ChengXiang Zhai. When documents are very long, bm25 fails! In SIGIR '11, pages 1103--1104, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR '94, pages 232--241, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Stephen Robertson, Hugo Zaragoza, and Michael Taylor. Simple bm25 extension to multiple weighted fields. In CIKM '04, pages 42--49, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at trec-3. In TREC '94, pages 109--126, 1994.Google ScholarGoogle Scholar
  12. Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In SIGIR '96, pages 21--29, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Adaptive term frequency normalization for BM25

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
      October 2011
      2712 pages
      ISBN:9781450307178
      DOI:10.1145/2063576

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 October 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • poster

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader