skip to main content
10.1145/2063576.2063584acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Lower-bounding term frequency normalization

Authors Info & Claims
Published:24 October 2011Publication History

ABSTRACT

In this paper, we reveal a common deficiency of the current retrieval models: the component of term frequency (TF) normalization by document length is not lower-bounded properly; as a result, very long documents tend to be overly penalized. In order to analytically diagnose this problem, we propose two desirable formal constraints to capture the heuristic of lower-bounding TF, and use constraint analysis to examine several representative retrieval functions. Analysis results show that all these retrieval functions can only satisfy the constraints for a certain range of parameter values and/or for a particular set of query terms. Empirical results further show that the retrieval performance tends to be poor when the parameter is out of the range or the query term is not in the particular set. To solve this common problem, we propose a general and efficient method to introduce a sufficiently large lower bound for TF normalization which can be shown analytically to fix or alleviate the problem. Our experimental results demonstrate that the proposed method, incurring almost no additional computational cost, can be applied to state-of-the-art retrieval functions, such as Okapi BM25, language models, and the divergence from randomness approach, to significantly improve the average precision, especially for verbose queries.

References

  1. G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20:357--389, October 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Clinchant and E. Gaussier. Information-based models for ad hoc ir. In SIGIR '10, pages 234--241, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Cummins and C. O'Riordan. An axiomatic comparison of learned term-weighting schemes in information retrieval: clarifications and extensions. Artif. Intell. Rev., 28:51--68, June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Fang, T. Tao, and C. Zhai. A formal study of information retrieval heuristics. In SIGIR '04, pages 49--56, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Fang, T. Tao, and C. Zhai. Diagnostic evaluation of information retrieval models. ACM Trans. Inf. Syst., 29:7:1--7:42, April 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. H. Fang and C. Zhai. An exploration of axiomatic approaches to information retrieval. In SIGIR '05, pages 480--487, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. Fang and C. Zhai. Semantic term matching in axiomatic approaches to information retrieval. In SIGIR '06, pages 115--122, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Fuhr. Probabilistic models in information retrieval. The Computer Journal, 35:243--255, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev., 1:309--317, October 1957. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Lv and C. Zhai. Adaptive term frequency normalization for bm25. In CIKM '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Lv and C. Zhai. When documents are very long, bm25 fails! In SIGIR '11, pages 1103--1104, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR '98, pages 275--281, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society of Information Science, 27(3):129--146, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  14. S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR '94, pages 232--241, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. In TREC '94, pages 109--126, 1994.Google ScholarGoogle Scholar
  16. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18(11):613--620, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Singhal. Modern information retrieval: a brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24:2001, 2001.Google ScholarGoogle Scholar
  18. A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In SIGIR '96, pages 21--29, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. In SIGIR '07, pages 295--302, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Zhai and J. D. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR '01, pages 334--342, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Lower-bounding term frequency normalization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
      October 2011
      2712 pages
      ISBN:9781450307178
      DOI:10.1145/2063576

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 October 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader