ABSTRACT
Recent efforts in test collection building have focused on scaling back the number of necessary relevance judgments and then scaling up the number of search topics. Since the largest source of variation in a Cranfield-style experiment comes from the topics, this is a reasonable approach. However, as topic set sizes grow, and researchers look to crowdsourcing and Amazon's Mechanical Turk to collect relevance judgments, we are faced with issues of quality control. This paper examines the robustness of the TREC Million Query track methods when some assessors make significant and systematic errors. We find that while averages are robust, assessor errors can have a large effect on system rankings.
- James Allan, Javed A. Aslam, Ben Carterette, Virgil Pavlu, and Evangelos Kanoulas. Overview of the TREC 2008 million query track. In Proceedings of TREC, 2008.Google Scholar
- Javed A. Aslam and Virgil Pavlu. A practical sampling strategy for efficient retrieval evaluation, technical report.Google Scholar
- Javed A. Aslam, Virgil Pavlu, and Emine Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proceedings of SIGIR, pages 541--548, 2006. Google ScholarDigital Library
- Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and Emine Yilmaz. Relevance assessment: Are judges exchangeable and does it matter? In Proceedings of SIGIR, pages 667--674, 2008. Google ScholarDigital Library
- Ben Carterette. Robust evaluation of information retrieval systems. In Proceedings of SIGIR, 2007.Google ScholarDigital Library
- Ben Carterette, James Allan, and Ramesh K. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings of SIGIR, pages 268--275, 2006. Google ScholarDigital Library
- Ben Carterette, Virgil Pavlu, Evangelos Kanoulas, Javed A. Aslam, and James Allan. Evaluation over thousands of queries. In Proceedings of SIGIR, pages 651--658, 2008. Google ScholarDigital Library
- Gordon V. Cormack, Christopher R. Palmer, and Charles L.A. Clarke. Efficient construction of large test collections. In Proceedings of SIGIR, pages 282--289, 1998. Google ScholarDigital Library
- Donna Harman. Overview of the fourth Text REtrieval Conference. In Proceedings of the Fourth Text REtrieval Conference (TREC-4), pages 1--24, 1995. NIST Special Publication 500-236.Google ScholarCross Ref
- Kenneth A. Kinney, Scott Huffman, and Juting Zhai. How evaluator domain expertise affects search result relevance judgments. In Proceedings of CIKM, pages 591--598, 2008. Google ScholarDigital Library
- Mark Sanderson and Justin Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of SIGIR, pages 186--193, 2005. Google ScholarDigital Library
- Ian Soboroff, Charles Nicholas, and Patrick Cahan. Ranking Retrieval Systems without Relevance Judgments. In Proceedings of SIGIR, pages 66--73, 2001. Google ScholarDigital Library
- Ellen Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of SIGIR, pages 315--323, 1998. Google ScholarDigital Library
- Ellen M. Voorhees. The philosophy of information retrieval evaluation. In CLEF '01: Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems, pages 355--370, London, UK, 2002. Springer-Verlag. Google ScholarDigital Library
- Ellen M. Voorhees and Donna K. Harman, editors. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005. Google ScholarDigital Library
- Emine Yilmaz and Javed Aslam. Estimating average precision with incomplete and imperfect relevance judgments. In Proceedings of CIKM, pages 102--111, 2006. Google ScholarDigital Library
- Justin Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In Proceedings of SIGIR, pages 307--314, 1998. Google ScholarDigital Library
Index Terms
- The effect of assessor error on IR system evaluation
Recommendations
Considering Assessor Agreement in IR Evaluation
ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information RetrievalThe agreement between relevance assessors is an important but understudied topic in the Information Retrieval literature because of the limited data available about documents assessed by multiple judges. This issue has gained even more importance ...
Industrial evaluation of a highly-accurate academic IR system
CIKM '03: Proceedings of the twelfth international conference on Information and knowledge managementIn this paper we report the results of an independent experimental evaluation of an information retrieval (IR) system developed at the Illinois Institute of Technology (IIT). The system, which is called the Advanced Information Retrieval Engine (AIRE), ...
Time to judge relevance as an indicator of assessor error
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalWhen human assessors judge documents for their relevance to a search topic, it is possible for errors in judging to occur. As part of the analysis of the data collected from a 48 participant user study, we have discovered that when the participants made ...
Comments