research-article

Offline Comparative Evaluation with Incremental, Minimally-Invasive Online Feedback

Authors:
Ben Carterette

University of Delaware, Newark, DE, USA

University of Delaware, Newark, DE, USA
View Profile

,
Praveen Chandar

Spotify, New York, NY, USA

Spotify, New York, NY, USA
View Profile

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information RetrievalJune 2018Pages 705–714https://doi.org/10.1145/3209978.3210050

Published:27 June 2018Publication History

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Pages 705–714

ABSTRACT

We investigate the use of logged user interaction data---queries and clicks---for offline evaluation of new search systems in the context of counterfactual analysis. The challenge of evaluating a new ranker against log data collected from a static production ranker is that new rankers may retrieve documents that have never been seen in the logs before, and thus lack any logged feedback from users. Additionally, the ranker itself could bias user actions such that even documents that have been seen in the logs would have exhibited different interaction patterns had they been retrieved and ranked by the new ranker. We present a methodology for incrementally logging interactions on previously-unseen documents for use in computation of an unbiased estimator of a new ranker's effectiveness. Our method is very lightly invasive with respect to the production ranker results to insure against users becoming dissatisfied if the new ranker is poor. We demonstrate how well our methods work in a simulation environment designed to be challenging for such methods to argue that they are likely to work in a wide variety of scenarios.

References

R. Agrawal, A. Halverson, K. Kenthapadi, N. Mishra, and P. Tsaparas. Generating labels from clicks. WSDM '09. 2009. Google ScholarDigital Library
L. Bottou, J. Peters, J. Q. Candela, D. X. Charles, M. Chickering, E. Portugaly, D. Ray, P. Y. Simard, and E. Snelson. Counterfactual reasoning and learning systems - the example of computational advertising. Journal of Machine Learning Research (), 2013. Google ScholarDigital Library
B. Carterette and R. Jones. Evaluating search engines by modeling the relationship between relevance and clicks. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, NIPS, pages 217--224. 2008. Google ScholarDigital Library
O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. Trans. Inf. Systems, 30(1):6:1--41, 2012. Google ScholarDigital Library
O. Chapelle and Y. Zhang. A dynamic bayesian network click model for web search ranking. 2009.Google ScholarDigital Library
A. Chuklin, I. Markov, and M. de Rijke. Click Models for Web Search. Synthesis Lectures on Information Concepts, Retrieval, and Services, 7(3):1--115, 2015.Google Scholar
G. V. Cormack, C. R. Palmer, and C. L. a. Clarke. Efficient construction of large test collections. ACM, New York, New York, USA, Aug. 1998. Google ScholarDigital Library
N. Craswell, O. Zoeter, M. J. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. WSDM, page 87, 2008. Google ScholarDigital Library
M. Dud'ık, D. Erhan, J. Langford, and L. Li. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29(4):485--511, 2014.Google Scholar
G. E. Dupret and B. Piwowarski. A user browsing model to predict search engine click data from past observations. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '08, pages 331--338, 2008. Google ScholarDigital Library
K. Hofmann, L. Li, and F. Radlinski. Online Evaluation for Information Retrieval. Foundations and Trends® in Information Retrieval, 10(1):1--117, 2016. Google ScholarDigital Library
K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 249--258, 2011. Google ScholarDigital Library
T. Joachims. Optimizing search engines using clickthrough data. conference on Knowledge discovery and data mining, 2002. Google ScholarDigital Library
T. Joachims, L. a. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. ACM Transactions on Information Systems, 25(2):7--es, 2007. Google ScholarDigital Library
T. Joachims, A. Swaminathan, and T. Schnabel. Unbiased Learning-to-Rank with Biased Feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining - WSDM '17, 2017. Google ScholarDigital Library
M. T. Keane and M. O'Brien. Click Models for Web Search. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 28, 2006.Google Scholar
R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne. Controlled experiments on the web: Survey and practical guide. Data Min. Knowl. Discov., 18(1):140--181, 2009. Google ScholarDigital Library
D. Li and E. Kanoulas. Active Sampling for Large-scale Information Retrieval Evaluation. arXiv.org, pages 49--58, Sept. 2017. Google ScholarDigital Library
L. Li, S. Chen, J. Kleban, and A. Gupta. Counterfactual Estimation and Optimization of Click Metrics in Search Engines - A Case Study. pages 929--934, 2015. Google ScholarDigital Library
A. Lipani, J. R. M. Palotti, M. Lupu, F. Piroi, G. Zuccon, and A. Hanbury. Fixed-Cost Pooling Strategies Based on IR Evaluation Measures. ECIR, 2017.Google ScholarCross Ref
D. E. Losada, J. Parapar, and A. Barreiro. Feeling lucky? - multi-armed bandits for ordering judgements in pooling-based evaluation. SAC, pages 1027--1034, 2016. Google ScholarDigital Library
A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Info. Sys., 27(1):1--27, 2008. Google ScholarDigital Library
U. Ozertem, R. Jones, and B. Dumoulin. Evaluating new search engine configurations with pre-existing judgments and clicks. In Proceedings of the 20th International Conference on World Wide Web, WWW '11, pages 397--406, 2011. Google ScholarDigital Library
V. Pavlu and J. Aslam. A practical sampling strategy for efficient retrieval evaluation. 2007.Google Scholar
V. Pavlu, E. Yilmaz, J. A. Aslam, and H. Ave. A Statistical Method for System Evaluation Using Incomplete Judgments. pages 541--548, 2006. Google ScholarDigital Library
F. Radlinski and T. Joachims. Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 2, AAAI'06, 2006. Google ScholarDigital Library
F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM '08, pages 43--52, 2008. Google ScholarDigital Library
M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through rate for new ads. 2007.Google ScholarDigital Library
P. R. ROSENBAUM and D. B. RUBIN. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41--55, 1983.Google ScholarCross Ref
E. M. Voorhees and D. K. Harman. TREC : Experiment and Evaluation in Information Retrieval. MIT Press, 2005. Google ScholarDigital Library
X. Wang, M. Bendersky, D. Metzler, and M. Najork. Learning to Rank with Selection Bias in Personal Search. pages 115--124, 2016. Google ScholarDigital Library
E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In the 31st annual international ACM SIGIR conference, page 603, New York, New York, USA, July 2008. ACM Request Permissions. Google ScholarDigital Library
Y. Yue, R. Patel, and H. Roehrig. Beyond position bias: Examining result attractiveness as a source of presentation bias in clickthrough data. In Proc. of the 19th International Conference on World Wide Web, WWW, pages 1011--1018, 2010. Google ScholarDigital Library

Index Terms

Offline Comparative Evaluation with Incremental, Minimally-Invasive Online Feedback
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

On the evaluation of snippet selection for WebCLEF
CLEF'08: Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access

WebCLEF is about supporting a user who is an expert in writing a survey article on a specific topic with a clear goal and audience by generating a ranked list with relevant snippets. This paper focuses on the evaluation methodology of WebCLEF. We show ...
Read More
Estimating Clickthrough Bias in the Cascade Model
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

Recently, there has been considerable interest in the use of historical logged user interaction data—queries and clicks—for evaluation of search systems in the context of counterfactual analysis [8,10]. Recent approaches attempt to de-bias the ...
Read More
Taking the Counterfactual Online: Efficient and Unbiased Online Evaluation for Ranking
ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval

Counterfactual evaluation can estimate Click-Through-Rate (CTR) differences between ranking systems based on historical interaction data, while mitigating the effect of position bias and item-selection bias. We introduce the novel Logging-Policy ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
June 2018
1509 pages
ISBN:9781450356572
DOI:10.1145/3209978
General Chairs:
Kevyn Collins-Thompson
University of Michigan, United States
,
Qiaozhu Mei
University of Michigan, United States
,
Program Chairs:
Brian Davison
Lehigh University, United States
,
Yiqun Liu
Tsinghua University, China
,
Emine Yilmaz
University College London, United Kingdom
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 June 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
counterfactual evaluation
experimentation
ips estimate
measurement
performance
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '18 Paper Acceptance Rate86of409submissions,21%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 317
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Offline Comparative Evaluation with Incremental, Minimally-Invasive Online Feedback

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the evaluation of snippet selection for WebCLEF

Estimating Clickthrough Bias in the Cascade Model

Taking the Counterfactual Online: Efficient and Unbiased Online Evaluation for Ranking