skip to main content
10.1145/3209978.3210050acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Offline Comparative Evaluation with Incremental, Minimally-Invasive Online Feedback

Published:27 June 2018Publication History

ABSTRACT

We investigate the use of logged user interaction data---queries and clicks---for offline evaluation of new search systems in the context of counterfactual analysis. The challenge of evaluating a new ranker against log data collected from a static production ranker is that new rankers may retrieve documents that have never been seen in the logs before, and thus lack any logged feedback from users. Additionally, the ranker itself could bias user actions such that even documents that have been seen in the logs would have exhibited different interaction patterns had they been retrieved and ranked by the new ranker. We present a methodology for incrementally logging interactions on previously-unseen documents for use in computation of an unbiased estimator of a new ranker's effectiveness. Our method is very lightly invasive with respect to the production ranker results to insure against users becoming dissatisfied if the new ranker is poor. We demonstrate how well our methods work in a simulation environment designed to be challenging for such methods to argue that they are likely to work in a wide variety of scenarios.

References

  1. R. Agrawal, A. Halverson, K. Kenthapadi, N. Mishra, and P. Tsaparas. Generating labels from clicks. WSDM '09. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Bottou, J. Peters, J. Q. Candela, D. X. Charles, M. Chickering, E. Portugaly, D. Ray, P. Y. Simard, and E. Snelson. Counterfactual reasoning and learning systems - the example of computational advertising. Journal of Machine Learning Research (), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Carterette and R. Jones. Evaluating search engines by modeling the relationship between relevance and clicks. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, NIPS, pages 217--224. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. Trans. Inf. Systems, 30(1):6:1--41, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. O. Chapelle and Y. Zhang. A dynamic bayesian network click model for web search ranking. 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Chuklin, I. Markov, and M. de Rijke. Click Models for Web Search. Synthesis Lectures on Information Concepts, Retrieval, and Services, 7(3):1--115, 2015.Google ScholarGoogle Scholar
  7. G. V. Cormack, C. R. Palmer, and C. L. a. Clarke. Efficient construction of large test collections. ACM, New York, New York, USA, Aug. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Craswell, O. Zoeter, M. J. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. WSDM, page 87, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Dud'ık, D. Erhan, J. Langford, and L. Li. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29(4):485--511, 2014.Google ScholarGoogle Scholar
  10. G. E. Dupret and B. Piwowarski. A user browsing model to predict search engine click data from past observations. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '08, pages 331--338, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Hofmann, L. Li, and F. Radlinski. Online Evaluation for Information Retrieval. Foundations and Trends® in Information Retrieval, 10(1):1--117, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 249--258, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Joachims. Optimizing search engines using clickthrough data. conference on Knowledge discovery and data mining, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Joachims, L. a. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. ACM Transactions on Information Systems, 25(2):7--es, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Joachims, A. Swaminathan, and T. Schnabel. Unbiased Learning-to-Rank with Biased Feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining - WSDM '17, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. T. Keane and M. O'Brien. Click Models for Web Search. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 28, 2006.Google ScholarGoogle Scholar
  17. R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne. Controlled experiments on the web: Survey and practical guide. Data Min. Knowl. Discov., 18(1):140--181, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Li and E. Kanoulas. Active Sampling for Large-scale Information Retrieval Evaluation. arXiv.org, pages 49--58, Sept. 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Li, S. Chen, J. Kleban, and A. Gupta. Counterfactual Estimation and Optimization of Click Metrics in Search Engines - A Case Study. pages 929--934, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Lipani, J. R. M. Palotti, M. Lupu, F. Piroi, G. Zuccon, and A. Hanbury. Fixed-Cost Pooling Strategies Based on IR Evaluation Measures. ECIR, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  21. D. E. Losada, J. Parapar, and A. Barreiro. Feeling lucky? - multi-armed bandits for ordering judgements in pooling-based evaluation. SAC, pages 1027--1034, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Info. Sys., 27(1):1--27, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. U. Ozertem, R. Jones, and B. Dumoulin. Evaluating new search engine configurations with pre-existing judgments and clicks. In Proceedings of the 20th International Conference on World Wide Web, WWW '11, pages 397--406, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. V. Pavlu and J. Aslam. A practical sampling strategy for efficient retrieval evaluation. 2007.Google ScholarGoogle Scholar
  25. V. Pavlu, E. Yilmaz, J. A. Aslam, and H. Ave. A Statistical Method for System Evaluation Using Incomplete Judgments. pages 541--548, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. F. Radlinski and T. Joachims. Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 2, AAAI'06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM '08, pages 43--52, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through rate for new ads. 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. R. ROSENBAUM and D. B. RUBIN. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41--55, 1983.Google ScholarGoogle ScholarCross RefCross Ref
  30. E. M. Voorhees and D. K. Harman. TREC : Experiment and Evaluation in Information Retrieval. MIT Press, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. X. Wang, M. Bendersky, D. Metzler, and M. Najork. Learning to Rank with Selection Bias in Personal Search. pages 115--124, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In the 31st annual international ACM SIGIR conference, page 603, New York, New York, USA, July 2008. ACM Request Permissions. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Yue, R. Patel, and H. Roehrig. Beyond position bias: Examining result attractiveness as a source of presentation bias in clickthrough data. In Proc. of the 19th International Conference on World Wide Web, WWW, pages 1011--1018, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Offline Comparative Evaluation with Incremental, Minimally-Invasive Online Feedback

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
      June 2018
      1509 pages
      ISBN:9781450356572
      DOI:10.1145/3209978

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 June 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGIR '18 Paper Acceptance Rate86of409submissions,21%Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader