research-article

An analysis of systematic judging errors in information retrieval

Authors:
Gabriella Kazai

Microsoft Research, Cambridge, United Kingdom

Microsoft Research, Cambridge, United Kingdom
View Profile

,
Nick Craswell

Microsoft, Bellevue, WA, USA

Microsoft, Bellevue, WA, USA
View Profile

,
Emine Yilmaz

Microsoft Research, Cambridge, United Kingdom

Microsoft Research, Cambridge, United Kingdom
View Profile

,
S.M.M Tahaghoghi

Microsoft, Bellevue, WA, USA

Microsoft, Bellevue, WA, USA
View Profile

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementOctober 2012Pages 105–114https://doi.org/10.1145/2396761.2396779

Published:29 October 2012Publication History

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Pages 105–114

ABSTRACT

Test collections are powerful mechanisms for the evaluation and optimization of information retrieval systems. However, there is reported evidence that experiment outcomes can be affected by changes to the judging guidelines or changes in the judge population. This paper examines such effects in a web search setting, comparing the judgments of four groups of judges: NIST Web Track judges, untrained crowd workers and two groups of trained judges of a commercial search engine. Our goal is to identify systematic judging errors by comparing the labels contributed by the different groups, working under the same or different judging guidelines. In particular, we focus on detecting systematic differences in judging depending on specific characteristics of the queries and URLs. For example, we ask whether a given population of judges, working under a given set of judging guidelines, are more likely to consistently overrate Wikipedia pages than another group judging under the same instructions. Our approach is to identify judging errors with respect to a consensus set, a judged gold set and a set of user clicks. We further demonstrate how such biases can affect the training of retrieval systems.

References

O. Alonso and R. Baeza-Yates. Design and implementation of relevance assessments using crowdsourcing. In Proc. of the 33rd European Conference on Advances in Information Retrieval, ECIR'11, pages 153--164. Springer-Verlag, 2011. Google ScholarDigital Library
P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In Proc. of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'08, pages 667--674. ACM, 2008. Google ScholarDigital Library
P. Borlund. The concept of relevance in IR. Journal of the American Society for Information Science and Technology, 54(10):913--925, May 2003. Google ScholarDigital Library
C. J. C. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions. In NIPS, pages 193--200. MIT Press, 2006.Google ScholarDigital Library
B. Carterette and I. Soboroff. The effect of assessor error on IR system evaluation. In Proc. of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'10, pages 539--546. ACM, 2010. Google ScholarDigital Library
C. L. Clarke, N. Craswell, I. Soboroff, and A. Ashkan. A comparative analysis of cascade measures for novelty and diversity. In Proceedings of the fourth ACM international conference on Web search and data mining, WSDM'11, pages 75--84. ACM, 2011. Google ScholarDigital Library
C. Cuadra and R. Katter. The relevance of relevance assessment. Proc. of the American Documentation Institute, page 95--99.Google Scholar
A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), pages 20--28.Google Scholar
C. Grady and M. Lease. Crowdsourcing document relevance assessment with mechanical turk. In Proc. of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 172--179, 2010. Google ScholarDigital Library
D. Hawking and N. Craswell. The very large collection and web tracks. In E. Voorhees and D. Harman, editors, TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005.Google Scholar
J. Howe. Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business. Crown Publishing Group, New York, NY, USA, 1 edition, 2008. Google ScholarDigital Library
P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proc. of the ACM SIGKDD Workshop on Human Computation, HCOMP'10, pages 64--67. ACM, 2010. Google ScholarDigital Library
K. H. Jiyin He, Krisztian Balog and E. Meij. Heuristic ranking and diversification of web documents. In TREC, 2009.Google Scholar
G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. Crowdsourcing for book search evaluation: impact of HIT design on comparative system ranking. In Proc. of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, pages 205--214. ACM, 2011. Google ScholarDigital Library
K. A. Kinney, S. B. Huffman, and J. Zhai. How evaluator domain expertise affects search result relevance judgments. In Proc. of the 17th ACM Conference on Information and Knowledge Management, CIKM'08, pages 591--598. ACM, 2008. Google ScholarDigital Library
E. Pronin. Perception and misperception of bias in human judgment. Trends in cognitive sciences, 11(1):37--43, 2007.Google Scholar
T. Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. part iii: Behavior and effects of relevance. J. Am. Soc. Inf. Sci. Technol., 58(13):2126--2144, Nov. 2007. Google ScholarDigital Library
F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements. In Proc. of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, pages 1063--1072. ACM, 2011. Google ScholarDigital Library
E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proc. of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'98, pages 315--323. ACM, 1998. Google ScholarDigital Library
E. M. Voorhees. Evaluation by highly relevant documents. In Proc. of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'01, pages 74--82. ACM, 2001. Google ScholarDigital Library
E. M. Voorhees and D. K. Harman, editors. TREC: Experimentation and Evaluation in Information Retrieval. MIT Press, 2005.Google Scholar
W. Webber, D. W. Oard, F. Scholer, and B. Hedin. Assessor error in stratified evaluation. In Proc. of the 19th ACM International Conference on Information and Knowledge Management, CIKM'10, pages 539--548. ACM, 2010. Google ScholarDigital Library

Index Terms

An analysis of systematic judging errors in information retrieval
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Tolerance of Effectiveness Measures to Relevance Judging Errors
ECIR 2014: Proceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 8416

Crowdsourcing relevance judgments for test collection construction is attractive because the practice has the possibility of being more affordable than hiring high quality assessors. A problem faced by all crowdsourced judgments --- even judgments ...
Read More
A qualitative exploration of secondary assessor relevance judging behavior
IIiX '14: Proceedings of the 5th Information Interaction in Context Symposium

Secondary assessors frequently differ in their relevance judgments. Primary assessors are those that originate a search topic and whose judgments truly reflect the assessor's relevance criteria. Secondary assessors do not originate the search and must ...
Read More
Algorithmic Bias: Do Good Systems Make Relevant Documents More Retrievable?
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Algorithmic bias presents a difficult challenge within Information Retrieval. Long has it been known that certain algorithms favour particular documents due to attributes of these documents that are not directly related to relevance. The evaluation of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
October 2012
2840 pages
ISBN:9781450311564
DOI:10.1145/2396761
General Chair:
Xuewen Chen
Wayne State University, USA
,
Program Chairs:
Guy Lebanon
Georgia Institute of Technology
,
Haixun Wang
Microsoft Research Asia
,
Mohammed J. Zaki
Rensselaer Polytechnic Institute
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 October 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bias
noise
relevence
Qualifiers
- research-article
Conference
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 478
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An analysis of systematic judging errors in information retrieval

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Tolerance of Effectiveness Measures to Relevance Judging Errors

A qualitative exploration of secondary assessor relevance judging behavior

Algorithmic Bias: Do Good Systems Make Relevant Documents More Retrievable?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An analysis of systematic judging errors in information retrieval

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Tolerance of Effectiveness Measures to Relevance Judging Errors

A qualitative exploration of secondary assessor relevance judging behavior

Algorithmic Bias: Do Good Systems Make Relevant Documents More Retrievable?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media