ABSTRACT
We consider the problem of information retrieval evaluation and the methods and metrics used for such evaluations. We propose a probabilistic framework for evaluation which we use to develop new information-theoretic evaluation metrics. We demonstrate that these new metrics are powerful and generalizable, enabling evaluations heretofore not possible.
We introduce four preliminary uses of our framework: (1) a measure of conditional rank correlation, information tau, a powerful meta-evaluation tool whose use we demonstrate on understanding novelty and diversity evaluation; (2) a new evaluation measure, relevance information correlation, which is correlated with traditional evaluation measures and can be used to (3) evaluate a collection of systems simultaneously, which provides a natural upper bound on metasearch performance; and (4) a measure of the similarity between rankers on judged documents, information difference, which allows us to determine whether systems with similar performance are in fact different.
- Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. Diversifying search results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM '09, pages 5-14, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- Chris Buckley and Ellen M. Voorhees. Retrieval evaluation with incomplete information. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '04, 2004. Google ScholarDigital Library
- Christopher J.C. Burges. From ranknet to lambdarank to lambdamart: An overview. Technical Report MSR-TR-2010-82, Microsoft Research, 2010.Google Scholar
- Ben Carterette and Paul N. Bennett. Evaluation measures for preference judgments. In SIGIR, 2008. Google ScholarDigital Library
- Ben Carterette, Paul N. Bennett, David Maxwell Chickering, and Susan T. Dumais. Here or there: preference judgments for relevance. In Proceedings of the IR research, 30th European conference on Advances in information retrieval, ECIR'08, 2008. Google ScholarDigital Library
- Praveen Chandar and Ben Carterette. Using preference judgments for novel document retrieval. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR '12, 2012. Google ScholarDigital Library
- Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management, CIKM '09, pages 621-630, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Gordon V. Cormack. Overview of the TREC 2010 Web Track. In 19th Text REtrieval Conference, Gaithersburg, Maryland, 2010.Google Scholar
- Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Ellen M. Voorhees. Overview of the TREC 2011 Web Track. In 20th Text REtrieval Conference, Gaithersburg, Maryland, 2011.Google Scholar
- Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Buttcher, and Ian MacKinnon. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '08, pages 659-666, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-Interscience, 1991. Google ScholarDigital Library
- Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. An experimental comparison of click position-bias models. In Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM '08, pages 87-94, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res., 4, December 2003. Google ScholarDigital Library
- Peter B. Golbus, Javed A. Aslam, and Charles L.A. Clarke. Increasing evaluation sensitivity to diversity. In Journal of Information Retrieval, To Appear. Google ScholarDigital Library
- Kalervo Jarvelin and Jaana Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422-446, October 2002. Google ScholarDigital Library
- Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '02, 2002. Google ScholarDigital Library
- M. G. Kendall. A New Measure of Rank Correlation. Biometrika, 30(1/2):81-93, June 1938.Google ScholarCross Ref
- Alistair Moffat and Justin Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst., 27(1):2:1-2:27, December 2008. Google ScholarDigital Library
- Mark Montague. Metasearch: Data Fusion for Document Retrieval. PhD thesis, Dartmouth College. Dept. of Computer Science, 2002. Google ScholarDigital Library
- Mark Montague and Javed A. Aslam. Condorcet fusion for improved retrieval. In Proceedings of the eleventh international conference on Information and knowledge management, CIKM '02, 2002. Google ScholarDigital Library
- Stephen E. Robertson, Evangelos Kanoulas, and Emine Yilmaz. Extending average precision to graded relevance judgments. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '10, 2010. Google ScholarDigital Library
- Tetsuya Sakai. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '06, 2006. Google ScholarDigital Library
- Tetsuya Sakai. Alternatives to Bpref. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '07, 2007. Google ScholarDigital Library
- Tetsuya Sakai and Ruihua Song. Evaluating diversified search results using per-intent graded relevance. In SIGIR, pages 1043-1052, 2011. Google ScholarDigital Library
- Joseph A. Shaw and Edward A. Fox. Combination of multiple searches. In The Second Text REtrieval Conference (TREC-2), pages 243-252, 1994.Google Scholar
- Ruihua Song, Min Zhang, Tetsuya Sakai, Makoto P. Kato, Yiqun Liu, Miho Sugimoto, Qinglei Wang, and Naoki Orii. Overview of the ntcir-9 intent task. In Proceedings of the 9th NTCIR Workshop, Tokyo, Japan, 2011.Google Scholar
- C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 1904.Google ScholarCross Ref
- E. M. Voorhees and D. Harman. Overview of the eighth text retrieval conference (TREC-8). In Proceedings of the Eighth Text REtrieval Conference (TREC-8), 2000.Google ScholarCross Ref
- E. M. Voorhees and D. Harman. Overview of the ninth text retrieval conference (TREC-9). In Proceedings of the Ninth Text REtrieval Conference (TREC-9), 2001.Google ScholarCross Ref
- Emine Yilmaz and Javed A. Aslam. Estimating average precision with incomplete and imperfect judgments. In Proceedings of the 15th ACM international conference on Information and knowledge management, CIKM '06, 2006. Google ScholarDigital Library
- Emine Yilmaz, Javed A. Aslam, and Stephen Robertson. A new rank correlation coefficient for information retrieval. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, 2008. Google ScholarDigital Library
- Emine Yilmaz, Evangelos Kanoulas, and Javed A. Aslam. A simple and efficient sampling method for estimating AP and nDCG. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, 2008. Google ScholarDigital Library
- Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179-214, April 2004. Google ScholarDigital Library
Index Terms
- A mutual information-based framework for the analysis of information retrieval systems
Recommendations
On the information difference between standard retrieval models
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrievalRecent work introduced a probabilistic framework that measures search engine performance information-theoretically. This allows for novel meta-evaluation measures such as Information Difference, which measures the magnitude of the difference between ...
Current Status of the Evaluation of Information Retrieval
This is the second in the series of the articles on an application of the systems analytic approach to evaluation of information retrieval (IR). In the previous article a historical overview of IR was presented and existing terminological problems ...
SIGIR 2013 workshop on modeling user behavior for information retrieval evaluation
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalThe SIGIR 2013 Workshop on Modeling User Behavior for Information Retrieval Evaluation (MUBE 2013) brings together people to discuss existing and new approaches, ways to collaborate, and other ideas and issues involved in improving information retrieval ...
Comments