ABSTRACT
Test collections are powerful mechanisms for the evaluation and optimization of information retrieval systems. However, there is reported evidence that experiment outcomes can be affected by changes to the judging guidelines or changes in the judge population. This paper examines such effects in a web search setting, comparing the judgments of four groups of judges: NIST Web Track judges, untrained crowd workers and two groups of trained judges of a commercial search engine. Our goal is to identify systematic judging errors by comparing the labels contributed by the different groups, working under the same or different judging guidelines. In particular, we focus on detecting systematic differences in judging depending on specific characteristics of the queries and URLs. For example, we ask whether a given population of judges, working under a given set of judging guidelines, are more likely to consistently overrate Wikipedia pages than another group judging under the same instructions. Our approach is to identify judging errors with respect to a consensus set, a judged gold set and a set of user clicks. We further demonstrate how such biases can affect the training of retrieval systems.
- O. Alonso and R. Baeza-Yates. Design and implementation of relevance assessments using crowdsourcing. In Proc. of the 33rd European Conference on Advances in Information Retrieval, ECIR'11, pages 153--164. Springer-Verlag, 2011. Google ScholarDigital Library
- P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In Proc. of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'08, pages 667--674. ACM, 2008. Google ScholarDigital Library
- P. Borlund. The concept of relevance in IR. Journal of the American Society for Information Science and Technology, 54(10):913--925, May 2003. Google ScholarDigital Library
- C. J. C. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions. In NIPS, pages 193--200. MIT Press, 2006.Google ScholarDigital Library
- B. Carterette and I. Soboroff. The effect of assessor error on IR system evaluation. In Proc. of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'10, pages 539--546. ACM, 2010. Google ScholarDigital Library
- C. L. Clarke, N. Craswell, I. Soboroff, and A. Ashkan. A comparative analysis of cascade measures for novelty and diversity. In Proceedings of the fourth ACM international conference on Web search and data mining, WSDM'11, pages 75--84. ACM, 2011. Google ScholarDigital Library
- C. Cuadra and R. Katter. The relevance of relevance assessment. Proc. of the American Documentation Institute, page 95--99.Google Scholar
- A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), pages 20--28.Google Scholar
- C. Grady and M. Lease. Crowdsourcing document relevance assessment with mechanical turk. In Proc. of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 172--179, 2010. Google ScholarDigital Library
- D. Hawking and N. Craswell. The very large collection and web tracks. In E. Voorhees and D. Harman, editors, TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005.Google Scholar
- J. Howe. Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business. Crown Publishing Group, New York, NY, USA, 1 edition, 2008. Google ScholarDigital Library
- P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proc. of the ACM SIGKDD Workshop on Human Computation, HCOMP'10, pages 64--67. ACM, 2010. Google ScholarDigital Library
- K. H. Jiyin He, Krisztian Balog and E. Meij. Heuristic ranking and diversification of web documents. In TREC, 2009.Google Scholar
- G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. Crowdsourcing for book search evaluation: impact of HIT design on comparative system ranking. In Proc. of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, pages 205--214. ACM, 2011. Google ScholarDigital Library
- K. A. Kinney, S. B. Huffman, and J. Zhai. How evaluator domain expertise affects search result relevance judgments. In Proc. of the 17th ACM Conference on Information and Knowledge Management, CIKM'08, pages 591--598. ACM, 2008. Google ScholarDigital Library
- E. Pronin. Perception and misperception of bias in human judgment. Trends in cognitive sciences, 11(1):37--43, 2007.Google Scholar
- T. Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. part iii: Behavior and effects of relevance. J. Am. Soc. Inf. Sci. Technol., 58(13):2126--2144, Nov. 2007. Google ScholarDigital Library
- F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements. In Proc. of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, pages 1063--1072. ACM, 2011. Google ScholarDigital Library
- E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proc. of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'98, pages 315--323. ACM, 1998. Google ScholarDigital Library
- E. M. Voorhees. Evaluation by highly relevant documents. In Proc. of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'01, pages 74--82. ACM, 2001. Google ScholarDigital Library
- E. M. Voorhees and D. K. Harman, editors. TREC: Experimentation and Evaluation in Information Retrieval. MIT Press, 2005.Google Scholar
- W. Webber, D. W. Oard, F. Scholer, and B. Hedin. Assessor error in stratified evaluation. In Proc. of the 19th ACM International Conference on Information and Knowledge Management, CIKM'10, pages 539--548. ACM, 2010. Google ScholarDigital Library
Index Terms
- An analysis of systematic judging errors in information retrieval
Recommendations
Tolerance of Effectiveness Measures to Relevance Judging Errors
ECIR 2014: Proceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 8416Crowdsourcing relevance judgments for test collection construction is attractive because the practice has the possibility of being more affordable than hiring high quality assessors. A problem faced by all crowdsourced judgments --- even judgments ...
A qualitative exploration of secondary assessor relevance judging behavior
IIiX '14: Proceedings of the 5th Information Interaction in Context SymposiumSecondary assessors frequently differ in their relevance judgments. Primary assessors are those that originate a search topic and whose judgments truly reflect the assessor's relevance criteria. Secondary assessors do not originate the search and must ...
Algorithmic Bias: Do Good Systems Make Relevant Documents More Retrievable?
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge ManagementAlgorithmic bias presents a difficult challenge within Information Retrieval. Long has it been known that certain algorithms favour particular documents due to attributes of these documents that are not directly related to relevance. The evaluation of ...
Comments