skip to main content
10.1145/2505515.2505716acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

User intent and assessor disagreement in web search evaluation

Published:27 October 2013Publication History

ABSTRACT

Preference based methods for collecting relevance data for information retrieval (IR) evaluation have been shown to lead to better inter-assessor agreement than the traditional method of judging individual documents. However, little is known as to why preference judging reduces assessor disagreement and whether better agreement among assessors also means better agreement with user satisfaction, as signaled by user clicks. In this paper, we examine the relationship between assessor disagreement and various click based measures, such as click preference strength and user intent similarity, for judgments collected from editorial judges and crowd workers using single absolute, pairwise absolute and pairwise preference based judging methods. We find that trained judges are significantly more likely to agree with each other and with users than crowd workers, but inter-assessor agreement does not mean agreement with users. Switching to a pairwise judging mode improves crowdsourcing quality close to that of trained judges. We also find a relationship between intent similarity and assessor-user agreement, where the nature of the relationship changes across judging modes. Overall, our findings suggest that the awareness of different possible intents, enabled by pairwise judging, is a key reason of the improved agreement, and a crucial requirement when crowdsourcing relevance data.

References

  1. R. Agrawal, A. Halverson, K. Kenthapadi, N. Mishra, and P. Tsaparas. Generating labels from clicks. In Proc. 2nd ACM Intern'l Conf. on Web Search and Data Mining, WSDM '09, pages 172--181. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. O. Alonso and R. A. Baeza-Yates. Design and implementation of relevance assessments using crowdsourcing. In Proc. 33rd European Conference on IR Research (ECIR 2011), volume 6611 of LNCS, pages 153--164. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. O. Alonso and S. Mizzaro. Can we get rid of TREC assessors? using Mechanical Turk for relevance assessment. In S. Geva, J. Kamps, C. Peters, T. Sakai, A. Trotman, and E. Voorhees, editors, Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, pages 15--16, 2009.Google ScholarGoogle Scholar
  4. B. Carterette, P. N. Bennett, D. M. Chickering, and S. T. Dumais. Here or there: preference judgments for relevance. In Proc. 30th European Conference on Advances in Information Retrieval, ECIR';08, pages 16--27. Springer-Verlag, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. W. Cleverdon. The Cranfield tests on index language devices. Aslib, 19:173--192, 1967.Google ScholarGoogle ScholarCross RefCross Ref
  6. N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. In Proc. International Conference on Web Search and Web Data Mining, WSDM '08, pages 87--94. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems on the world wide web. Commun. ACM, 54:86--96, April 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. S. Downs, M. B. Holbrook, S. Sheng, and L. F. Cranor. Are your participants gaming the system?: screening Mechanical Turk workers. In Proc. 28th International Conference on Human Factors in Computing Systems, CHI 2010, pages 2399--2402. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Dupret and C. Liao. A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine. In Proc. 3rd ACM Inter'l Conf. on Web Search and Data Mining, WSDM '10, pages 181--190, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Grady and M. Lease. Crowdsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, CSLDAMT '10, pages 172--179, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. In Proc. 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 249--258. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Howe. Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business. Crown Publishing Group, New York, NY, USA, 1 edition, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Joachims. Optimizing search engines using clickthrough data. In Proc. 8th ACM SIGKDD International Conference on Knowledge discovery and data mining, KDD '02, pages 133--142. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proc. 28th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '05, pages 154--161, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Kapelner and D. Chandler. Preventing satisficing in online surveys: A 'kapcha' to ensure higher quality data. In The World's First Conference on the Future of Distributed Work (CrowdConf2010), 2010.Google ScholarGoogle Scholar
  16. G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. Crowdsourcing for book search evaluation: Impact of quality on comparative system ranking. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Le, A. Edmonds, V. Hester, and L. Biewald. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In SIGIR Workshop on Crowdsourcing for Search Evaluation, pages 21--26, 2010.Google ScholarGoogle Scholar
  18. M. Lease and G. Kazai. Overview of the TREC 2011 crowdsourcing track. In Proc. of TREC 2011, 2011.Google ScholarGoogle Scholar
  19. D. W. Oard, B. Hedin, S. Tomlinson, and J. R. Baron. Overview of the TREC 2008 legal track. In Proc. 17th Text REtrieval Conference, pages 1--45, Gaithersburg, Maryland, USA, November 2008. NIST Special Publication 500--277.Google ScholarGoogle Scholar
  20. A. J. Quinn and B. B. Bederson. Human computation: a survey and taxonomy of a growing field. In Proceedings of the 2011 Annual Conference on Human factors in computing systems, CHI '11, pages 1403--1412, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Radlinski, P. N. Bennett, and E. Yilmaz. Detecting duplicate web documents using clickthrough data. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM '11, pages 147--156, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. F. Radlinski and T. Joachims. Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In Proceedings of the 21st National Conference on Artificial intelligence - Volume 2, AAAI'06, pages 1406--1412. AAAI Press, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. F. Radlinski, M. Szummer, and N. Craswell. Inferring query intent from reformulations and clicks. In Proc. 19th International Conference on the World Wide Web, WWW '10, pages 1171--1172, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. E. M. Voorhees and D. K. Harman, editors. TREC: Experimentation and Evaluation in Information Retrieval. MIT Press, 2005.Google ScholarGoogle Scholar
  25. W. Webber, P. Chandar, and B. Carterette. Alternative assessor disagreement and retrieval depth. In Proc. 21st ACM international conference on Information and knowledge management, CIKM '12, pages 125--134, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Zhu and B. Carterette. An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation, pages 17--20, 2010.Google ScholarGoogle Scholar

Index Terms

  1. User intent and assessor disagreement in web search evaluation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
      October 2013
      2612 pages
      ISBN:9781450322638
      DOI:10.1145/2505515

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 October 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      CIKM '13 Paper Acceptance Rate143of848submissions,17%Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader