ABSTRACT
We investigate the use of logged user interaction data---queries and clicks---for offline evaluation of new search systems in the context of counterfactual analysis. The challenge of evaluating a new ranker against log data collected from a static production ranker is that new rankers may retrieve documents that have never been seen in the logs before, and thus lack any logged feedback from users. Additionally, the ranker itself could bias user actions such that even documents that have been seen in the logs would have exhibited different interaction patterns had they been retrieved and ranked by the new ranker. We present a methodology for incrementally logging interactions on previously-unseen documents for use in computation of an unbiased estimator of a new ranker's effectiveness. Our method is very lightly invasive with respect to the production ranker results to insure against users becoming dissatisfied if the new ranker is poor. We demonstrate how well our methods work in a simulation environment designed to be challenging for such methods to argue that they are likely to work in a wide variety of scenarios.
- R. Agrawal, A. Halverson, K. Kenthapadi, N. Mishra, and P. Tsaparas. Generating labels from clicks. WSDM '09. 2009. Google ScholarDigital Library
- L. Bottou, J. Peters, J. Q. Candela, D. X. Charles, M. Chickering, E. Portugaly, D. Ray, P. Y. Simard, and E. Snelson. Counterfactual reasoning and learning systems - the example of computational advertising. Journal of Machine Learning Research (), 2013. Google ScholarDigital Library
- B. Carterette and R. Jones. Evaluating search engines by modeling the relationship between relevance and clicks. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, NIPS, pages 217--224. 2008. Google ScholarDigital Library
- O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. Trans. Inf. Systems, 30(1):6:1--41, 2012. Google ScholarDigital Library
- O. Chapelle and Y. Zhang. A dynamic bayesian network click model for web search ranking. 2009.Google ScholarDigital Library
- A. Chuklin, I. Markov, and M. de Rijke. Click Models for Web Search. Synthesis Lectures on Information Concepts, Retrieval, and Services, 7(3):1--115, 2015.Google Scholar
- G. V. Cormack, C. R. Palmer, and C. L. a. Clarke. Efficient construction of large test collections. ACM, New York, New York, USA, Aug. 1998. Google ScholarDigital Library
- N. Craswell, O. Zoeter, M. J. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. WSDM, page 87, 2008. Google ScholarDigital Library
- M. Dud'ık, D. Erhan, J. Langford, and L. Li. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29(4):485--511, 2014.Google Scholar
- G. E. Dupret and B. Piwowarski. A user browsing model to predict search engine click data from past observations. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '08, pages 331--338, 2008. Google ScholarDigital Library
- K. Hofmann, L. Li, and F. Radlinski. Online Evaluation for Information Retrieval. Foundations and Trends® in Information Retrieval, 10(1):1--117, 2016. Google ScholarDigital Library
- K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 249--258, 2011. Google ScholarDigital Library
- T. Joachims. Optimizing search engines using clickthrough data. conference on Knowledge discovery and data mining, 2002. Google ScholarDigital Library
- T. Joachims, L. a. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. ACM Transactions on Information Systems, 25(2):7--es, 2007. Google ScholarDigital Library
- T. Joachims, A. Swaminathan, and T. Schnabel. Unbiased Learning-to-Rank with Biased Feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining - WSDM '17, 2017. Google ScholarDigital Library
- M. T. Keane and M. O'Brien. Click Models for Web Search. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 28, 2006.Google Scholar
- R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne. Controlled experiments on the web: Survey and practical guide. Data Min. Knowl. Discov., 18(1):140--181, 2009. Google ScholarDigital Library
- D. Li and E. Kanoulas. Active Sampling for Large-scale Information Retrieval Evaluation. arXiv.org, pages 49--58, Sept. 2017. Google ScholarDigital Library
- L. Li, S. Chen, J. Kleban, and A. Gupta. Counterfactual Estimation and Optimization of Click Metrics in Search Engines - A Case Study. pages 929--934, 2015. Google ScholarDigital Library
- A. Lipani, J. R. M. Palotti, M. Lupu, F. Piroi, G. Zuccon, and A. Hanbury. Fixed-Cost Pooling Strategies Based on IR Evaluation Measures. ECIR, 2017.Google ScholarCross Ref
- D. E. Losada, J. Parapar, and A. Barreiro. Feeling lucky? - multi-armed bandits for ordering judgements in pooling-based evaluation. SAC, pages 1027--1034, 2016. Google ScholarDigital Library
- A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Info. Sys., 27(1):1--27, 2008. Google ScholarDigital Library
- U. Ozertem, R. Jones, and B. Dumoulin. Evaluating new search engine configurations with pre-existing judgments and clicks. In Proceedings of the 20th International Conference on World Wide Web, WWW '11, pages 397--406, 2011. Google ScholarDigital Library
- V. Pavlu and J. Aslam. A practical sampling strategy for efficient retrieval evaluation. 2007.Google Scholar
- V. Pavlu, E. Yilmaz, J. A. Aslam, and H. Ave. A Statistical Method for System Evaluation Using Incomplete Judgments. pages 541--548, 2006. Google ScholarDigital Library
- F. Radlinski and T. Joachims. Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 2, AAAI'06, 2006. Google ScholarDigital Library
- F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM '08, pages 43--52, 2008. Google ScholarDigital Library
- M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through rate for new ads. 2007.Google ScholarDigital Library
- P. R. ROSENBAUM and D. B. RUBIN. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41--55, 1983.Google ScholarCross Ref
- E. M. Voorhees and D. K. Harman. TREC : Experiment and Evaluation in Information Retrieval. MIT Press, 2005. Google ScholarDigital Library
- X. Wang, M. Bendersky, D. Metzler, and M. Najork. Learning to Rank with Selection Bias in Personal Search. pages 115--124, 2016. Google ScholarDigital Library
- E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In the 31st annual international ACM SIGIR conference, page 603, New York, New York, USA, July 2008. ACM Request Permissions. Google ScholarDigital Library
- Y. Yue, R. Patel, and H. Roehrig. Beyond position bias: Examining result attractiveness as a source of presentation bias in clickthrough data. In Proc. of the 19th International Conference on World Wide Web, WWW, pages 1011--1018, 2010. Google ScholarDigital Library
Index Terms
- Offline Comparative Evaluation with Incremental, Minimally-Invasive Online Feedback
Recommendations
On the evaluation of snippet selection for WebCLEF
CLEF'08: Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information accessWebCLEF is about supporting a user who is an expert in writing a survey article on a specific topic with a clear goal and audience by generating a ranked list with relevant snippets. This paper focuses on the evaluation methodology of WebCLEF. We show ...
Estimating Clickthrough Bias in the Cascade Model
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge ManagementRecently, there has been considerable interest in the use of historical logged user interaction data—queries and clicks—for evaluation of search systems in the context of counterfactual analysis [8,10]. Recent approaches attempt to de-bias the ...
Taking the Counterfactual Online: Efficient and Unbiased Online Evaluation for Ranking
ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information RetrievalCounterfactual evaluation can estimate Click-Through-Rate (CTR) differences between ranking systems based on historical interaction data, while mitigating the effect of position bias and item-selection bias. We introduce the novel Logging-Policy ...
Comments