skip to main content
10.1145/1645953.1646237acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

A machine learning approach for improved BM25 retrieval

Published:02 November 2009Publication History

ABSTRACT

Despite the widespread use of BM25, there have been few studies examining its effectiveness on a document description over single and multiple field combinations. We determine the effectiveness of BM25 on various document fields. We find that BM25 models relevance on popularity fields such as anchor text and query click information no better than a linear function of the field attributes. We also find query click information to be the single most important field for retrieval. In response, we develop a machine learning approach to BM25-style retrieval that learns, using LambdaRank, from the input attributes of BM25. Our model significantly improves retrieval effectiveness over BM25 and BM25F. Our data-driven approach is fast, effective, avoids the problem of parameter tuning, and can directly optimize for several common information retrieval measures. We demonstrate the advantages of our model on a very large real-world Web data collection.

References

  1. E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 19--26, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Burges, R. Ragno, and Q. Le. Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems (NIPS), 2006. See also MSR Technical Report MSR-TR-2006-60.Google ScholarGoogle Scholar
  3. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In International Conference on Machine Learning (ICML), Bonn, Germany, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Craswell and D. Hawking. Overview of the TREC 2004 web track. In Proceedings of TREC 2004, 2004.Google ScholarGoogle Scholar
  5. N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the TREC 2003 web track. In Proceedings of TREC 2003, 2003.Google ScholarGoogle Scholar
  6. N. Craswell and M. Szummer. Random walk on the click graph. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Donmez, K. Svore, and C. Burges. On the local optimality of LambdaRank. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie. Smoothing clickthrough data for web search ranking. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. He and I. Ounis. On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Transactions on Information Systems (TOIS), 25(3):13, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 41--48, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Joachims. Optimizing search engines using clickthrough data. In SIGKDD, pages 133--142, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Metzler. Generalized inverse document frequency. In ACM Conference on Information Knowledge Management (CIKM), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Ogilvie and J. Callan. Combining document representations for known item search. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Rasolofo and J. Savoy. Term proximity scoring for keyword-based retrieval systems. In Proceedings of the 25th European Conference on IR Research (ECIR), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 345--354, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. In ACM Conference on Information Knowledge Management (CIKM), pages 42--49, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 21--29, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Song, M. Taylor, J.-R. Wen, H.-W. Hon, and Y. Yu. Viewing term proximity from a different perspective. Advances in Information Retrieval, Lecture Notes in Computer Science, 4956/2008:346--357, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Sparck-Jones, S. Walker, and S. Robertson. A probabilistic model of information retrieval: development and comparative experiments. Information Processing and Management, 36:809--840, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Svore and C. Burges. A machine learning approach improved bm25 retrieval. Microsoft Technical Report MSR-TR-2009-92, 2009.Google ScholarGoogle Scholar
  21. M. Taylor, H. Zaragoza, N. Craswell, S. Robertson, and C. Burges. Optimisation methods for ranking functions with multiple parameters. In ACM Conference on Information Knowledge Management (CIKM), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Wilkinson. Effective retrieval of structured documents. In Research and Development in Information Retrieval, pages 311--317, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Q. Wu, C. Burges, K. Svore, and J. Gao. Ranking, boosting and model adaptation. Microsoft Technical Report MSR-TR-2008-109, 2008.Google ScholarGoogle Scholar
  24. G. Xue, H.-J. Zeng, Z. Chen, Y. Yu, W.-Y. Ma, W. Xi, and W. Fan. Optimizing web search using web click-through information. In ACM Conference on Information Knowledge Management (CIKM), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Yue and C. Burges. On using simultaneous perturbation stochastic approximation for IR measures, and the empirical optimality of LambdaRank. NIPS Machine Learning for Web Search Workshop, 2007.Google ScholarGoogle Scholar

Index Terms

  1. A machine learning approach for improved BM25 retrieval

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
        November 2009
        2162 pages
        ISBN:9781605585123
        DOI:10.1145/1645953

        Copyright © 2009 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 November 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • poster

        Acceptance Rates

        Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader