ABSTRACT
Despite the widespread use of BM25, there have been few studies examining its effectiveness on a document description over single and multiple field combinations. We determine the effectiveness of BM25 on various document fields. We find that BM25 models relevance on popularity fields such as anchor text and query click information no better than a linear function of the field attributes. We also find query click information to be the single most important field for retrieval. In response, we develop a machine learning approach to BM25-style retrieval that learns, using LambdaRank, from the input attributes of BM25. Our model significantly improves retrieval effectiveness over BM25 and BM25F. Our data-driven approach is fast, effective, avoids the problem of parameter tuning, and can directly optimize for several common information retrieval measures. We demonstrate the advantages of our model on a very large real-world Web data collection.
- E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 19--26, 2006. Google ScholarDigital Library
- C. Burges, R. Ragno, and Q. Le. Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems (NIPS), 2006. See also MSR Technical Report MSR-TR-2006-60.Google Scholar
- C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In International Conference on Machine Learning (ICML), Bonn, Germany, 2005. Google ScholarDigital Library
- N. Craswell and D. Hawking. Overview of the TREC 2004 web track. In Proceedings of TREC 2004, 2004.Google Scholar
- N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the TREC 2003 web track. In Proceedings of TREC 2003, 2003.Google Scholar
- N. Craswell and M. Szummer. Random walk on the click graph. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2007. Google ScholarDigital Library
- P. Donmez, K. Svore, and C. Burges. On the local optimality of LambdaRank. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2009. Google ScholarDigital Library
- J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie. Smoothing clickthrough data for web search ranking. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2009. Google ScholarDigital Library
- B. He and I. Ounis. On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Transactions on Information Systems (TOIS), 25(3):13, 2007. Google ScholarDigital Library
- K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 41--48, 2000. Google ScholarDigital Library
- T. Joachims. Optimizing search engines using clickthrough data. In SIGKDD, pages 133--142, 2002. Google ScholarDigital Library
- D. Metzler. Generalized inverse document frequency. In ACM Conference on Information Knowledge Management (CIKM), 2008. Google ScholarDigital Library
- P. Ogilvie and J. Callan. Combining document representations for known item search. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2003. Google ScholarDigital Library
- Y. Rasolofo and J. Savoy. Term proximity scoring for keyword-based retrieval systems. In Proceedings of the 25th European Conference on IR Research (ECIR), 2003. Google ScholarDigital Library
- S. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 345--354, 1994. Google ScholarDigital Library
- S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. In ACM Conference on Information Knowledge Management (CIKM), pages 42--49, 2004. Google ScholarDigital Library
- A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 21--29, 1996. Google ScholarDigital Library
- R. Song, M. Taylor, J.-R. Wen, H.-W. Hon, and Y. Yu. Viewing term proximity from a different perspective. Advances in Information Retrieval, Lecture Notes in Computer Science, 4956/2008:346--357, 2008. Google ScholarDigital Library
- K. Sparck-Jones, S. Walker, and S. Robertson. A probabilistic model of information retrieval: development and comparative experiments. Information Processing and Management, 36:809--840, 2000. Google ScholarDigital Library
- K. Svore and C. Burges. A machine learning approach improved bm25 retrieval. Microsoft Technical Report MSR-TR-2009-92, 2009.Google Scholar
- M. Taylor, H. Zaragoza, N. Craswell, S. Robertson, and C. Burges. Optimisation methods for ranking functions with multiple parameters. In ACM Conference on Information Knowledge Management (CIKM), 2006. Google ScholarDigital Library
- R. Wilkinson. Effective retrieval of structured documents. In Research and Development in Information Retrieval, pages 311--317, 1994. Google ScholarDigital Library
- Q. Wu, C. Burges, K. Svore, and J. Gao. Ranking, boosting and model adaptation. Microsoft Technical Report MSR-TR-2008-109, 2008.Google Scholar
- G. Xue, H.-J. Zeng, Z. Chen, Y. Yu, W.-Y. Ma, W. Xi, and W. Fan. Optimizing web search using web click-through information. In ACM Conference on Information Knowledge Management (CIKM), 2004. Google ScholarDigital Library
- Y. Yue and C. Burges. On using simultaneous perturbation stochastic approximation for IR measures, and the empirical optimality of LambdaRank. NIPS Machine Learning for Web Search Workshop, 2007.Google Scholar
Index Terms
- A machine learning approach for improved BM25 retrieval
Recommendations
How good is a span of terms?: exploiting proximity to improve web retrieval
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrievalRanking search results is a fundamental problem in information retrieval. In this paper we explore whether the use of proximity and phrase information can improve web retrieval accuracy. We build on existing research by incorporating novel ranking ...
BM25t: a BM25 extension for focused information retrieval
This paper addresses the integration of XML tags into a term-weighting function for focused XML information retrieval (IR). Our model allows us to consider a certain kind of structural information: tags that represent a logical structure (e.g., title, ...
BM25-AH: Enhanced BM25 Algorithm for Domain-Specific Search Engine
iiWAS2019: Proceedings of the 21st International Conference on Information Integration and Web-based Applications & ServicesThe Virginia Military Institute (VMI) uses Google search to provide webpage search service in the VMI website. As Google search is a general-purpose service, it does not consider VMI-specific information and in turn, often fails to retrieve relevant ...
Comments