skip to main content
10.1145/1031171.1031181acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Simple BM25 extension to multiple weighted fields

Published:13 November 2004Publication History

ABSTRACT

This paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents. In the past it has been common to compute scores for the individual fields (e.g. title and body) independently and then combine these scores (typically linearly) to arrive at a final score for the document. We highlight how this approach can lead to poor performance by breaking the carefully constructed non-linear saturation of term frequency in the BM25 function. We propose a much more intuitive alternative which weights term frequencies <i>before</i> the non-linear term frequency saturation function is applied. In this scheme, a structured document with a title weight of two is mapped to an unstructured document with the title content repeated twice. This more verbose unstructured document is then ranked in the usual way. We demonstrate the advantages of this method with experiments on Reuters Vol1 and the TREC dotGov collection.

References

  1. David Carmel, Yoelle S. Maarek, Matan Mandelbrod, Yosi Mass, and Aya Soffer. Searching xml documents via xml fragments. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 151--158. ACM Press, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Nick Craswell and David Hawking. Overview of the TREC-2002 Web track. In TREC 2002, 2003.Google ScholarGoogle Scholar
  3. INEX. Initiative for the evaluation of XML retrieval (INEX), http://inex.is.informatik.uni-duisburg.de:2003.Google ScholarGoogle Scholar
  4. Evangelos Kotsakis. Structured information retrieval in XML documents. In SAC 2002, volume 1-58113-445-2/02/03, Madrid, Spain, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Mounia Lalmas. Uniform representation of content and structure for structured document retrieval. Technical report, Queen Mary and Westfield College, University of London, 2000.Google ScholarGoogle Scholar
  6. S H Myaeng, D-H Jang, M-S Kim, and Z-C Zhoo. A flexible model for retrieval of SGML documents. In W B Croft, A Moffat, C J van Rijsbergen, R Wilkinson, and J Zobel, editors, SIGIR'98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 138--145. ACM Press, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N.Craswell, D.Hawking, A.McLean, T.Upstill, R.Wilkinson and M.Wu. TREC 12 web track at CSIRO. In TREC 2003, 2004.Google ScholarGoogle Scholar
  8. Hongbo Xu, Zhifeng Yang, Bin Wang, Bin Liu, Jun Cheng, Yue Liu, Zhe Yang, Xueqi Cheng and Shuo Bai TREC-11 experiments at CAS-ICT: Filtering and Web. In TREC 2002, 2003.Google ScholarGoogle Scholar
  9. Lide Wu, Xuanjing Huang, Junyu Niu, Yingju Xia, Zhe Feng and Yaqian Zhou. FDU at TREC2002: Filtering, Q&A, Web and Video tasks. In TREC 2002, 2003.Google ScholarGoogle Scholar
  10. Einat Amitay, David Carmel, Adam Darlow, Ronny Lempel and Aya Soffer. Topic distillation with knowledge agents. In TREC 2002, 2003.Google ScholarGoogle Scholar
  11. Abdur Chowdhury, Mohammed Aljlayl, Eric Jensen, Steve Beitzel, David Grossman and Ophir Frieder. Linear combinations based on document structure and varied stemming for Arabic retrieval. In TREC 2002, 2003.Google ScholarGoogle Scholar
  12. Nie Yu, Ji Donghong and Yang Lingpeng. LIT at TREC-2002: Web track. In TREC 2002, 2003.Google ScholarGoogle Scholar
  13. Shuang Liu, Clement Yu and Wensheng Wu. UIC at TREC-2002: Web track. In TREC 2002, 2003.Google ScholarGoogle Scholar
  14. Jacques Savoy and Yves Rasolofo. Report on TREC-11 experiment: Arabic, Named Page and Topic Distillation searches. In TREC 2002, 2003.Google ScholarGoogle Scholar
  15. Paul Ogilvie and Jamie Callan. Combining document representations for known item search. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2003), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Benjamin Piwowarski and Patrick Gallinari. A machine learning model for information retrieval with structured documents. In Petra Perner, editor, Machine Learning and Data Mining in Pattern Recognition (MLDM'03), pages 425--438, Leipzig, Germany, July 2003. Springer Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. ReutersI. Reuters corpus volume 1, http://about.reuters.com/researchandstandards/corpus/index.asp.Google ScholarGoogle Scholar
  18. S E Robertson and S Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In W B Croft and C J van Rijsbergen, editors, SIGIR '94: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 345--354. Springer-Verlag, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ross Wilkinson. Effective retrieval of structured documents. In Research and Development in Information Retrieval, pages 311--317, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Simple BM25 extension to multiple weighted fields

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management
      November 2004
      678 pages
      ISBN:1581138741
      DOI:10.1145/1031171

      Copyright © 2004 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 November 2004

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader