ABSTRACT
Modern day federated search engines aggregate heterogeneous types of results from multiple vertical search engines and compose a single search engine result page (SERP). The search engine aggregates the results and produces one ranked list, constraining the vertical results to specific slots on the SERP.
The usual way to compare two ranking algorithms is to first fix their operating points (internal thresholds), and then run an online experiment that lasts multiple weeks. Online user engagement metrics are then compared to decide which algorithm is better. However, this method does not characterize and compare the behavior over the entire span of operating points. Furthermore, this time-consuming approach is not practical if we have to conduct the experiment over numerous operating points.
In this paper we propose a method of characterizing the performance of models that allows us to predict answers to "what if" questions about online user engagement using click-logs over the entire span of feasible operating points. We audition verticals at various slots on the SERP and generate click-logs. This log is then used to create operating curves between variables of interest (for example between result quality and click-through). The operating point for the system then can be chosen to achieve a specific trade-off between the variables. We apply this methodology to predict i) the online performance of two different models, ii) the impact of changing internal quality thresholds on clickthrough, iii) the behavior of introducing a new feature, iv) which machine learning loss function will give better online engagement, v) the impact of sampling distribution of head and tail queries in the training process. The results are reported on a well-known federated search engine. We validate the predictions with online experiments.
- J. Arguello, J. Callan, F. Diaz, and J. F. Crespo. Source of evidence for vertical selection. In Proc. of Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2009. Google ScholarDigital Library
- O. Chapelle and Ya Zhang. A dynamic bayesian network model for web search ranking. In Proc. of Intl. Conf. on World Wide Web, 2009. Google ScholarDigital Library
- K. Collins-Thompson. Accounting for stability in retrieval algorithms using risk-reward curves. In Proc. of SIGIR 2009 Workshop on the Future of Evaluation in Information Retrieval, pages 27--28, 2009.Google Scholar
- K. Collins-Thompson. Reducing the risk of query expansion via robust constrained optimization. In Intl. Conf. on Information and Knowledge Management, pages 837--846, 2009. Google ScholarDigital Library
- N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. In Intl. Conf. on Web Search and Data Mining, 2008. Google ScholarDigital Library
- F. Diaz and J. Arguello. Adaptation of offline selection predictions in presense of user feedback. In Proc. of Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2009. Google ScholarDigital Library
- G. Dupret and C. Liao. A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine. In Intl. Conf. on Web Search and Data Mining, 2010. Google ScholarDigital Library
- B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. CRC Press, 1993.Google ScholarCross Ref
- D. Fetterly, N. Craswell, and V. Vinay. The impact of crawl policy on web search effectiveness. In Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 58--587, 2009. Google ScholarDigital Library
- J. H. Friedman. Greedy function approximation: A graidient boosting machine. Annals of Statistics, 29:1189--1232, 2001.Google ScholarCross Ref
- K. Fukunaga. Statistical Pattern Recognition. Academic Press, 1990. Google ScholarDigital Library
- T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, New York, NY, 2001.Google ScholarCross Ref
- D. Hawking, N. Craswell, P. Thistiewaite, and D. Harman. Results and challenges in web search evaluation. Computer Networks, 31(11--16), 1999. Google ScholarDigital Library
- S. Ji, T. Moon, G. Dupret, C. Liao, and Z. Zheng. User behavior driven ranking without editorial judgments. In Proc. of Intl. Conf. on Information and Knowledge Management, 2010. Google ScholarDigital Library
- T. Joachims. A support vector method for multivariate performance measures. In Proc. Intl. Conference on Machine Learning, 2005. Google ScholarDigital Library
- T. Kanungo, M. Y. Jaisimha, J. Palmer, and R. M. Haralick. A quantitative methodology for analyzing the performance of detection algorithm. In Proc. of IEEE International Conference on Computer Vision, pages 247--252, 1993.Google ScholarCross Ref
- N. A. Macmillan and C. D. Creelman. Detection Theory. Lawerence Earlbaum Associates, Inc., 2005.Google Scholar
- A. K. Ponnuswami, K. Pattabiraman, Q. Wu, R. Gilad-Bachrach, and T. Kanungo. On composition of a federated web search result page: Using online users to provide pairwise preference for heterogeneous verticals. In Proc. of Intl. Conf. on Web Search and Data Mining, 2011. Google ScholarDigital Library
- F. Radlinksi and T. Joachims. Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In Proc. of AAAI Conf. on Artificial Intelligence, 2006. Google ScholarDigital Library
- F. Radlinski and T. Joachims. Query chains: Learning to rank from implicit feedback. pages 239--248, 2005. Google ScholarDigital Library
- M. Shokouhi and L. Si. Federated information retrieval. In D. W. Oard and Editors F. Sebastiani, editors, Foundations and Trends in Information Retrieval. 2010.Google Scholar
- L. Si and J. Callan. Modeling search engine effectiveness for federated search. In Proc. of Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2005. Google ScholarDigital Library
- H. L. Van Trees. Detection, Estimation, and Modulation Theory, Part 1. John Wiley and Sons, Inc., 2001.Google Scholar
- E. M. Voorhees and D. K. Harman, editors. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005. Google ScholarDigital Library
- K. Wang, N. Gloy, and X. Li. Inferring search behaviors using partially observable Markov (pom) model. In Intl. Conf. on Web Search and Data Mining, 2010. Google ScholarDigital Library
Index Terms
- Model characterization curves for federated search using click-logs: predicting user engagement metrics for the span of feasible operating points
Recommendations
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementThis work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Optimizing web search using web click-through data
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge managementThe performance of web search engines may often deteriorate due to the diversity and noisy information contained within web pages. User click-through data can be used to introduce more accurate description (metadata) for web pages, and to improve the ...
Improving Ranking Consistency for Web Search by Leveraging a Knowledge Base and Search Logs
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge ManagementIn this paper, we propose a new idea called ranking consistency in web search. Relevance ranking is one of the biggest problems in creating an effective web search system. Given some queries with similar search intents, conventional approaches typically ...
Comments