poster

A machine learning approach for improved BM25 retrieval

Authors:
Krysta M. Svore

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Christopher J.C. Burges

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementNovember 2009Pages 1811–1814https://doi.org/10.1145/1645953.1646237

Published:02 November 2009Publication History

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Pages 1811–1814

ABSTRACT

Despite the widespread use of BM25, there have been few studies examining its effectiveness on a document description over single and multiple field combinations. We determine the effectiveness of BM25 on various document fields. We find that BM25 models relevance on popularity fields such as anchor text and query click information no better than a linear function of the field attributes. We also find query click information to be the single most important field for retrieval. In response, we develop a machine learning approach to BM25-style retrieval that learns, using LambdaRank, from the input attributes of BM25. Our model significantly improves retrieval effectiveness over BM25 and BM25F. Our data-driven approach is fast, effective, avoids the problem of parameter tuning, and can directly optimize for several common information retrieval measures. We demonstrate the advantages of our model on a very large real-world Web data collection.

References

E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 19--26, 2006. Google ScholarDigital Library
C. Burges, R. Ragno, and Q. Le. Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems (NIPS), 2006. See also MSR Technical Report MSR-TR-2006-60.Google Scholar
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In International Conference on Machine Learning (ICML), Bonn, Germany, 2005. Google ScholarDigital Library
N. Craswell and D. Hawking. Overview of the TREC 2004 web track. In Proceedings of TREC 2004, 2004.Google Scholar
N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the TREC 2003 web track. In Proceedings of TREC 2003, 2003.Google Scholar
N. Craswell and M. Szummer. Random walk on the click graph. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2007. Google ScholarDigital Library
P. Donmez, K. Svore, and C. Burges. On the local optimality of LambdaRank. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2009. Google ScholarDigital Library
J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie. Smoothing clickthrough data for web search ranking. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2009. Google ScholarDigital Library
B. He and I. Ounis. On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Transactions on Information Systems (TOIS), 25(3):13, 2007. Google ScholarDigital Library
K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 41--48, 2000. Google ScholarDigital Library
T. Joachims. Optimizing search engines using clickthrough data. In SIGKDD, pages 133--142, 2002. Google ScholarDigital Library
D. Metzler. Generalized inverse document frequency. In ACM Conference on Information Knowledge Management (CIKM), 2008. Google ScholarDigital Library
P. Ogilvie and J. Callan. Combining document representations for known item search. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2003. Google ScholarDigital Library
Y. Rasolofo and J. Savoy. Term proximity scoring for keyword-based retrieval systems. In Proceedings of the 25th European Conference on IR Research (ECIR), 2003. Google ScholarDigital Library
S. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 345--354, 1994. Google ScholarDigital Library
S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. In ACM Conference on Information Knowledge Management (CIKM), pages 42--49, 2004. Google ScholarDigital Library
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 21--29, 1996. Google ScholarDigital Library
R. Song, M. Taylor, J.-R. Wen, H.-W. Hon, and Y. Yu. Viewing term proximity from a different perspective. Advances in Information Retrieval, Lecture Notes in Computer Science, 4956/2008:346--357, 2008. Google ScholarDigital Library
K. Sparck-Jones, S. Walker, and S. Robertson. A probabilistic model of information retrieval: development and comparative experiments. Information Processing and Management, 36:809--840, 2000. Google ScholarDigital Library
K. Svore and C. Burges. A machine learning approach improved bm25 retrieval. Microsoft Technical Report MSR-TR-2009-92, 2009.Google Scholar
M. Taylor, H. Zaragoza, N. Craswell, S. Robertson, and C. Burges. Optimisation methods for ranking functions with multiple parameters. In ACM Conference on Information Knowledge Management (CIKM), 2006. Google ScholarDigital Library
R. Wilkinson. Effective retrieval of structured documents. In Research and Development in Information Retrieval, pages 311--317, 1994. Google ScholarDigital Library
Q. Wu, C. Burges, K. Svore, and J. Gao. Ranking, boosting and model adaptation. Microsoft Technical Report MSR-TR-2008-109, 2008.Google Scholar
G. Xue, H.-J. Zeng, Z. Chen, Y. Yu, W.-Y. Ma, W. Xi, and W. Fan. Optimizing web search using web click-through information. In ACM Conference on Information Knowledge Management (CIKM), 2004. Google ScholarDigital Library
Y. Yue and C. Burges. On using simultaneous perturbation stochastic approximation for IR measures, and the empirical optimality of LambdaRank. NIPS Machine Learning for Web Search Workshop, 2007.Google Scholar

Index Terms

A machine learning approach for improved BM25 retrieval
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval

Recommendations

How good is a span of terms?: exploiting proximity to improve web retrieval
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Ranking search results is a fundamental problem in information retrieval. In this paper we explore whether the use of proximity and phrase information can improve web retrieval accuracy. We build on existing research by incorporating novel ranking ...
Read More
BM25t: a BM25 extension for focused information retrieval

This paper addresses the integration of XML tags into a term-weighting function for focused XML information retrieval (IR). Our model allows us to consider a certain kind of structural information: tags that represent a logical structure (e.g., title, ...
Read More
BM25-AH: Enhanced BM25 Algorithm for Domain-Specific Search Engine
iiWAS2019: Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services

The Virginia Military Institute (VMI) uses Google search to provide webpage search service in the VMI website. As Google search is a general-purpose service, it does not consider VMI-specific information and in turn, often fails to retrieve relevant ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
November 2009
2162 pages
ISBN:9781605585123
DOI:10.1145/1645953
General Chairs:
David Cheung
University of Hong Kong, Hong Kong
,
Il-Yeol Song
Drexel University, USA
,
Program Chairs:
Wesley Chu
UCLA, USA
,
Xiaohua Hu
Drexel University, USA
,
Jimmy Lin
University of Maryland, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 November 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bm25
learning to rank
retrieval models
web search
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 34
  Total Citations
  View Citations
- 744
  Total Downloads
- Downloads (Last 12 months)64
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A machine learning approach for improved BM25 retrieval

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

How good is a span of terms?: exploiting proximity to improve web retrieval

BM25t: a BM25 extension for focused information retrieval

BM25-AH: Enhanced BM25 Algorithm for Domain-Specific Search Engine

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A machine learning approach for improved BM25 retrieval

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

How good is a span of terms?: exploiting proximity to improve web retrieval

BM25t: a BM25 extension for focused information retrieval

BM25-AH: Enhanced BM25 Algorithm for Domain-Specific Search Engine

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media