research-article

The effect of assessor error on IR system evaluation

Authors:
Ben Carterette

University of Delaware, Newark, DE, USA

University of Delaware, Newark, DE, USA
View Profile

,
Ian Soboroff

National Institute of Standards and Technology, Gaithersburg, MD, USA

National Institute of Standards and Technology, Gaithersburg, MD, USA
View Profile

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrievalJuly 2010Pages 539–546https://doi.org/10.1145/1835449.1835540

Published:19 July 2010Publication History

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Pages 539–546

ABSTRACT

Recent efforts in test collection building have focused on scaling back the number of necessary relevance judgments and then scaling up the number of search topics. Since the largest source of variation in a Cranfield-style experiment comes from the topics, this is a reasonable approach. However, as topic set sizes grow, and researchers look to crowdsourcing and Amazon's Mechanical Turk to collect relevance judgments, we are faced with issues of quality control. This paper examines the robustness of the TREC Million Query track methods when some assessors make significant and systematic errors. We find that while averages are robust, assessor errors can have a large effect on system rankings.

References

James Allan, Javed A. Aslam, Ben Carterette, Virgil Pavlu, and Evangelos Kanoulas. Overview of the TREC 2008 million query track. In Proceedings of TREC, 2008.Google Scholar
Javed A. Aslam and Virgil Pavlu. A practical sampling strategy for efficient retrieval evaluation, technical report.Google Scholar
Javed A. Aslam, Virgil Pavlu, and Emine Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proceedings of SIGIR, pages 541--548, 2006. Google ScholarDigital Library
Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and Emine Yilmaz. Relevance assessment: Are judges exchangeable and does it matter? In Proceedings of SIGIR, pages 667--674, 2008. Google ScholarDigital Library
Ben Carterette. Robust evaluation of information retrieval systems. In Proceedings of SIGIR, 2007.Google ScholarDigital Library
Ben Carterette, James Allan, and Ramesh K. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings of SIGIR, pages 268--275, 2006. Google ScholarDigital Library
Ben Carterette, Virgil Pavlu, Evangelos Kanoulas, Javed A. Aslam, and James Allan. Evaluation over thousands of queries. In Proceedings of SIGIR, pages 651--658, 2008. Google ScholarDigital Library
Gordon V. Cormack, Christopher R. Palmer, and Charles L.A. Clarke. Efficient construction of large test collections. In Proceedings of SIGIR, pages 282--289, 1998. Google ScholarDigital Library
Donna Harman. Overview of the fourth Text REtrieval Conference. In Proceedings of the Fourth Text REtrieval Conference (TREC-4), pages 1--24, 1995. NIST Special Publication 500-236.Google ScholarCross Ref
Kenneth A. Kinney, Scott Huffman, and Juting Zhai. How evaluator domain expertise affects search result relevance judgments. In Proceedings of CIKM, pages 591--598, 2008. Google ScholarDigital Library
Mark Sanderson and Justin Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of SIGIR, pages 186--193, 2005. Google ScholarDigital Library
Ian Soboroff, Charles Nicholas, and Patrick Cahan. Ranking Retrieval Systems without Relevance Judgments. In Proceedings of SIGIR, pages 66--73, 2001. Google ScholarDigital Library
Ellen Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of SIGIR, pages 315--323, 1998. Google ScholarDigital Library
Ellen M. Voorhees. The philosophy of information retrieval evaluation. In CLEF '01: Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems, pages 355--370, London, UK, 2002. Springer-Verlag. Google ScholarDigital Library
Ellen M. Voorhees and Donna K. Harman, editors. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005. Google ScholarDigital Library
Emine Yilmaz and Javed Aslam. Estimating average precision with incomplete and imperfect relevance judgments. In Proceedings of CIKM, pages 102--111, 2006. Google ScholarDigital Library
Justin Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In Proceedings of SIGIR, pages 307--314, 1998. Google ScholarDigital Library

Index Terms

The effect of assessor error on IR system evaluation
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
  2. World Wide Web
    1. Web applications
    2. Web services

Recommendations

Considering Assessor Agreement in IR Evaluation
ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval

The agreement between relevance assessors is an important but understudied topic in the Information Retrieval literature because of the limited data available about documents assessed by multiple judges. This issue has gained even more importance ...
Read More
Industrial evaluation of a highly-accurate academic IR system
CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management

In this paper we report the results of an independent experimental evaluation of an information retrieval (IR) system developed at the Illinois Institute of Technology (IIT). The system, which is called the Advanced Information Retrieval Engine (AIRE), ...
Read More
Time to judge relevance as an indicator of assessor error
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

When human assessors judge documents for their relevance to a search topic, it is possible for errors in judging to occur. As part of the analysis of the data collected from a 48 participant user study, we have discovered that when the participants made ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
July 2010
944 pages
ISBN:9781450301534
DOI:10.1145/1835449
General Chairs:
Fabio Crestani
University of Lugano, CH
,
Stéphane Marchand-Maillet
University of Geneva, CH
,
Program Chairs:
Hsin-Hsi Chen
National Taiwan University, TW
,
Efthimis N. Efthimiadis
University of Washington, USA
,
Jacques Savoy
University of Neuchatel, CH
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
assessor error
retrieval test collections
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 68
  Total Citations
  View Citations
- 512
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The effect of assessor error on IR system evaluation

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Considering Assessor Agreement in IR Evaluation

Industrial evaluation of a highly-accurate academic IR system

Time to judge relevance as an indicator of assessor error