ABSTRACT
Preference based methods for collecting relevance data for information retrieval (IR) evaluation have been shown to lead to better inter-assessor agreement than the traditional method of judging individual documents. However, little is known as to why preference judging reduces assessor disagreement and whether better agreement among assessors also means better agreement with user satisfaction, as signaled by user clicks. In this paper, we examine the relationship between assessor disagreement and various click based measures, such as click preference strength and user intent similarity, for judgments collected from editorial judges and crowd workers using single absolute, pairwise absolute and pairwise preference based judging methods. We find that trained judges are significantly more likely to agree with each other and with users than crowd workers, but inter-assessor agreement does not mean agreement with users. Switching to a pairwise judging mode improves crowdsourcing quality close to that of trained judges. We also find a relationship between intent similarity and assessor-user agreement, where the nature of the relationship changes across judging modes. Overall, our findings suggest that the awareness of different possible intents, enabled by pairwise judging, is a key reason of the improved agreement, and a crucial requirement when crowdsourcing relevance data.
- R. Agrawal, A. Halverson, K. Kenthapadi, N. Mishra, and P. Tsaparas. Generating labels from clicks. In Proc. 2nd ACM Intern'l Conf. on Web Search and Data Mining, WSDM '09, pages 172--181. ACM, 2009. Google ScholarDigital Library
- O. Alonso and R. A. Baeza-Yates. Design and implementation of relevance assessments using crowdsourcing. In Proc. 33rd European Conference on IR Research (ECIR 2011), volume 6611 of LNCS, pages 153--164. Springer, 2011. Google ScholarDigital Library
- O. Alonso and S. Mizzaro. Can we get rid of TREC assessors? using Mechanical Turk for relevance assessment. In S. Geva, J. Kamps, C. Peters, T. Sakai, A. Trotman, and E. Voorhees, editors, Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, pages 15--16, 2009.Google Scholar
- B. Carterette, P. N. Bennett, D. M. Chickering, and S. T. Dumais. Here or there: preference judgments for relevance. In Proc. 30th European Conference on Advances in Information Retrieval, ECIR';08, pages 16--27. Springer-Verlag, 2008. Google ScholarDigital Library
- C. W. Cleverdon. The Cranfield tests on index language devices. Aslib, 19:173--192, 1967.Google ScholarCross Ref
- N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. In Proc. International Conference on Web Search and Web Data Mining, WSDM '08, pages 87--94. ACM, 2008. Google ScholarDigital Library
- A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems on the world wide web. Commun. ACM, 54:86--96, April 2011. Google ScholarDigital Library
- J. S. Downs, M. B. Holbrook, S. Sheng, and L. F. Cranor. Are your participants gaming the system?: screening Mechanical Turk workers. In Proc. 28th International Conference on Human Factors in Computing Systems, CHI 2010, pages 2399--2402. ACM, 2010. Google ScholarDigital Library
- G. Dupret and C. Liao. A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine. In Proc. 3rd ACM Inter'l Conf. on Web Search and Data Mining, WSDM '10, pages 181--190, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- C. Grady and M. Lease. Crowdsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, CSLDAMT '10, pages 172--179, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarDigital Library
- K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. In Proc. 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 249--258. ACM, 2011. Google ScholarDigital Library
- J. Howe. Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business. Crown Publishing Group, New York, NY, USA, 1 edition, 2008. Google ScholarDigital Library
- T. Joachims. Optimizing search engines using clickthrough data. In Proc. 8th ACM SIGKDD International Conference on Knowledge discovery and data mining, KDD '02, pages 133--142. ACM, 2002. Google ScholarDigital Library
- T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proc. 28th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '05, pages 154--161, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- A. Kapelner and D. Chandler. Preventing satisficing in online surveys: A 'kapcha' to ensure higher quality data. In The World's First Conference on the Future of Distributed Work (CrowdConf2010), 2010.Google Scholar
- G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. Crowdsourcing for book search evaluation: Impact of quality on comparative system ranking. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2011. Google ScholarDigital Library
- J. Le, A. Edmonds, V. Hester, and L. Biewald. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In SIGIR Workshop on Crowdsourcing for Search Evaluation, pages 21--26, 2010.Google Scholar
- M. Lease and G. Kazai. Overview of the TREC 2011 crowdsourcing track. In Proc. of TREC 2011, 2011.Google Scholar
- D. W. Oard, B. Hedin, S. Tomlinson, and J. R. Baron. Overview of the TREC 2008 legal track. In Proc. 17th Text REtrieval Conference, pages 1--45, Gaithersburg, Maryland, USA, November 2008. NIST Special Publication 500--277.Google Scholar
- A. J. Quinn and B. B. Bederson. Human computation: a survey and taxonomy of a growing field. In Proceedings of the 2011 Annual Conference on Human factors in computing systems, CHI '11, pages 1403--1412, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- F. Radlinski, P. N. Bennett, and E. Yilmaz. Detecting duplicate web documents using clickthrough data. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM '11, pages 147--156, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- F. Radlinski and T. Joachims. Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In Proceedings of the 21st National Conference on Artificial intelligence - Volume 2, AAAI'06, pages 1406--1412. AAAI Press, 2006. Google ScholarDigital Library
- F. Radlinski, M. Szummer, and N. Craswell. Inferring query intent from reformulations and clicks. In Proc. 19th International Conference on the World Wide Web, WWW '10, pages 1171--1172, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- E. M. Voorhees and D. K. Harman, editors. TREC: Experimentation and Evaluation in Information Retrieval. MIT Press, 2005.Google Scholar
- W. Webber, P. Chandar, and B. Carterette. Alternative assessor disagreement and retrieval depth. In Proc. 21st ACM international conference on Information and knowledge management, CIKM '12, pages 125--134, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- D. Zhu and B. Carterette. An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation, pages 17--20, 2010.Google Scholar
Index Terms
- User intent and assessor disagreement in web search evaluation
Recommendations
User Behavior Modeling for Web Image Search
WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data MiningWeb-based image search engines differ from Web search engines greatly. The intents or goals behind human interactions with image search engines are different. In image search, users mainly search images instead of Web pages or online services. It is ...
Document features predicting assessor disagreement
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalThe notion of relevance differs between assessors, thus giving rise to assessor disagreement. Although assessor disagreement has been frequently observed, the factors leading to disagreement are still an open problem. In this paper we study the ...
User Intent in Multimedia Search: A Survey of the State of the Art and Future Challenges
Today's multimedia search engines are expected to respond to queries reflecting a wide variety of information needs from users with different goals. The topical dimension (“what” the user is searching for) of these information needs is well studied; ...
Comments