research-article

User intent and assessor disagreement in web search evaluation

Authors:
Gabriella Kazai

Microsoft Research, Cambridge, United Kingdom

Microsoft Research, Cambridge, United Kingdom
View Profile

,
Emine Yilmaz

Microsoft Research, Cambridge, United Kingdom

Microsoft Research, Cambridge, United Kingdom
View Profile

,
Nick Craswell

Microsoft, Bellevue, USA

Microsoft, Bellevue, USA
View Profile

,
S.M.M. Tahaghoghi

Microsoft, Bellevue, USA

Microsoft, Bellevue, USA
View Profile

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementOctober 2013Pages 699–708https://doi.org/10.1145/2505515.2505716

Published:27 October 2013Publication History

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pages 699–708

ABSTRACT

Preference based methods for collecting relevance data for information retrieval (IR) evaluation have been shown to lead to better inter-assessor agreement than the traditional method of judging individual documents. However, little is known as to why preference judging reduces assessor disagreement and whether better agreement among assessors also means better agreement with user satisfaction, as signaled by user clicks. In this paper, we examine the relationship between assessor disagreement and various click based measures, such as click preference strength and user intent similarity, for judgments collected from editorial judges and crowd workers using single absolute, pairwise absolute and pairwise preference based judging methods. We find that trained judges are significantly more likely to agree with each other and with users than crowd workers, but inter-assessor agreement does not mean agreement with users. Switching to a pairwise judging mode improves crowdsourcing quality close to that of trained judges. We also find a relationship between intent similarity and assessor-user agreement, where the nature of the relationship changes across judging modes. Overall, our findings suggest that the awareness of different possible intents, enabled by pairwise judging, is a key reason of the improved agreement, and a crucial requirement when crowdsourcing relevance data.

References

R. Agrawal, A. Halverson, K. Kenthapadi, N. Mishra, and P. Tsaparas. Generating labels from clicks. In Proc. 2nd ACM Intern'l Conf. on Web Search and Data Mining, WSDM '09, pages 172--181. ACM, 2009. Google ScholarDigital Library
O. Alonso and R. A. Baeza-Yates. Design and implementation of relevance assessments using crowdsourcing. In Proc. 33rd European Conference on IR Research (ECIR 2011), volume 6611 of LNCS, pages 153--164. Springer, 2011. Google ScholarDigital Library
O. Alonso and S. Mizzaro. Can we get rid of TREC assessors? using Mechanical Turk for relevance assessment. In S. Geva, J. Kamps, C. Peters, T. Sakai, A. Trotman, and E. Voorhees, editors, Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, pages 15--16, 2009.Google Scholar
B. Carterette, P. N. Bennett, D. M. Chickering, and S. T. Dumais. Here or there: preference judgments for relevance. In Proc. 30th European Conference on Advances in Information Retrieval, ECIR';08, pages 16--27. Springer-Verlag, 2008. Google ScholarDigital Library
C. W. Cleverdon. The Cranfield tests on index language devices. Aslib, 19:173--192, 1967.Google ScholarCross Ref
N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. In Proc. International Conference on Web Search and Web Data Mining, WSDM '08, pages 87--94. ACM, 2008. Google ScholarDigital Library
A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems on the world wide web. Commun. ACM, 54:86--96, April 2011. Google ScholarDigital Library
J. S. Downs, M. B. Holbrook, S. Sheng, and L. F. Cranor. Are your participants gaming the system?: screening Mechanical Turk workers. In Proc. 28th International Conference on Human Factors in Computing Systems, CHI 2010, pages 2399--2402. ACM, 2010. Google ScholarDigital Library
G. Dupret and C. Liao. A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine. In Proc. 3rd ACM Inter'l Conf. on Web Search and Data Mining, WSDM '10, pages 181--190, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
C. Grady and M. Lease. Crowdsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, CSLDAMT '10, pages 172--179, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarDigital Library
K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. In Proc. 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 249--258. ACM, 2011. Google ScholarDigital Library
J. Howe. Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business. Crown Publishing Group, New York, NY, USA, 1 edition, 2008. Google ScholarDigital Library
T. Joachims. Optimizing search engines using clickthrough data. In Proc. 8th ACM SIGKDD International Conference on Knowledge discovery and data mining, KDD '02, pages 133--142. ACM, 2002. Google ScholarDigital Library
T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proc. 28th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '05, pages 154--161, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
A. Kapelner and D. Chandler. Preventing satisficing in online surveys: A 'kapcha' to ensure higher quality data. In The World's First Conference on the Future of Distributed Work (CrowdConf2010), 2010.Google Scholar
G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. Crowdsourcing for book search evaluation: Impact of quality on comparative system ranking. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2011. Google ScholarDigital Library
J. Le, A. Edmonds, V. Hester, and L. Biewald. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In SIGIR Workshop on Crowdsourcing for Search Evaluation, pages 21--26, 2010.Google Scholar
M. Lease and G. Kazai. Overview of the TREC 2011 crowdsourcing track. In Proc. of TREC 2011, 2011.Google Scholar
D. W. Oard, B. Hedin, S. Tomlinson, and J. R. Baron. Overview of the TREC 2008 legal track. In Proc. 17th Text REtrieval Conference, pages 1--45, Gaithersburg, Maryland, USA, November 2008. NIST Special Publication 500--277.Google Scholar
A. J. Quinn and B. B. Bederson. Human computation: a survey and taxonomy of a growing field. In Proceedings of the 2011 Annual Conference on Human factors in computing systems, CHI '11, pages 1403--1412, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
F. Radlinski, P. N. Bennett, and E. Yilmaz. Detecting duplicate web documents using clickthrough data. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM '11, pages 147--156, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
F. Radlinski and T. Joachims. Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In Proceedings of the 21st National Conference on Artificial intelligence - Volume 2, AAAI'06, pages 1406--1412. AAAI Press, 2006. Google ScholarDigital Library
F. Radlinski, M. Szummer, and N. Craswell. Inferring query intent from reformulations and clicks. In Proc. 19th International Conference on the World Wide Web, WWW '10, pages 1171--1172, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
E. M. Voorhees and D. K. Harman, editors. TREC: Experimentation and Evaluation in Information Retrieval. MIT Press, 2005.Google Scholar
W. Webber, P. Chandar, and B. Carterette. Alternative assessor disagreement and retrieval depth. In Proc. 21st ACM international conference on Information and knowledge management, CIKM '12, pages 125--134, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
D. Zhu and B. Carterette. An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation, pages 17--20, 2010.Google Scholar

Index Terms

User intent and assessor disagreement in web search evaluation
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

User Behavior Modeling for Web Image Search
WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Web-based image search engines differ from Web search engines greatly. The intents or goals behind human interactions with image search engines are different. In image search, users mainly search images instead of Web pages or online services. It is ...
Read More
Document features predicting assessor disagreement
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

The notion of relevance differs between assessors, thus giving rise to assessor disagreement. Although assessor disagreement has been frequently observed, the factors leading to disagreement are still an open problem. In this paper we study the ...
Read More
User Intent in Multimedia Search: A Survey of the State of the Art and Future Challenges

Today's multimedia search engines are expected to respond to queries reflecting a wide variety of information needs from users with different goals. The topical dimension (“what” the user is searching for) of these information needs is well studied; ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
October 2013
2612 pages
ISBN:9781450322638
DOI:10.1145/2505515
General Chairs:
Qi He
LinkedIn, USA
,
Arun Iyengar
IBM T.J. Watson Research Center, USA
,
Program Chairs:
Wolfgang Nejdl
L3S Research Center, Germany
,
Jian Pei
Simon Fraser University, Canada
,
Rajeev Rastogi
Amazon, India
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
click evidence
crowd vs editorial judges
single vs pairwise judging
user intent
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '13 Paper Acceptance Rate143of848submissions,17%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 339
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

User intent and assessor disagreement in web search evaluation

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

User Behavior Modeling for Web Image Search

Document features predicting assessor disagreement

User Intent in Multimedia Search: A Survey of the State of the Art and Future Challenges