skip to main content
10.1145/1835449.1835540acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

The effect of assessor error on IR system evaluation

Published:19 July 2010Publication History

ABSTRACT

Recent efforts in test collection building have focused on scaling back the number of necessary relevance judgments and then scaling up the number of search topics. Since the largest source of variation in a Cranfield-style experiment comes from the topics, this is a reasonable approach. However, as topic set sizes grow, and researchers look to crowdsourcing and Amazon's Mechanical Turk to collect relevance judgments, we are faced with issues of quality control. This paper examines the robustness of the TREC Million Query track methods when some assessors make significant and systematic errors. We find that while averages are robust, assessor errors can have a large effect on system rankings.

References

  1. James Allan, Javed A. Aslam, Ben Carterette, Virgil Pavlu, and Evangelos Kanoulas. Overview of the TREC 2008 million query track. In Proceedings of TREC, 2008.Google ScholarGoogle Scholar
  2. Javed A. Aslam and Virgil Pavlu. A practical sampling strategy for efficient retrieval evaluation, technical report.Google ScholarGoogle Scholar
  3. Javed A. Aslam, Virgil Pavlu, and Emine Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proceedings of SIGIR, pages 541--548, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and Emine Yilmaz. Relevance assessment: Are judges exchangeable and does it matter? In Proceedings of SIGIR, pages 667--674, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ben Carterette. Robust evaluation of information retrieval systems. In Proceedings of SIGIR, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ben Carterette, James Allan, and Ramesh K. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings of SIGIR, pages 268--275, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ben Carterette, Virgil Pavlu, Evangelos Kanoulas, Javed A. Aslam, and James Allan. Evaluation over thousands of queries. In Proceedings of SIGIR, pages 651--658, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gordon V. Cormack, Christopher R. Palmer, and Charles L.A. Clarke. Efficient construction of large test collections. In Proceedings of SIGIR, pages 282--289, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Donna Harman. Overview of the fourth Text REtrieval Conference. In Proceedings of the Fourth Text REtrieval Conference (TREC-4), pages 1--24, 1995. NIST Special Publication 500-236.Google ScholarGoogle ScholarCross RefCross Ref
  10. Kenneth A. Kinney, Scott Huffman, and Juting Zhai. How evaluator domain expertise affects search result relevance judgments. In Proceedings of CIKM, pages 591--598, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Mark Sanderson and Justin Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of SIGIR, pages 186--193, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ian Soboroff, Charles Nicholas, and Patrick Cahan. Ranking Retrieval Systems without Relevance Judgments. In Proceedings of SIGIR, pages 66--73, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ellen Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of SIGIR, pages 315--323, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ellen M. Voorhees. The philosophy of information retrieval evaluation. In CLEF '01: Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems, pages 355--370, London, UK, 2002. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ellen M. Voorhees and Donna K. Harman, editors. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Emine Yilmaz and Javed Aslam. Estimating average precision with incomplete and imperfect relevance judgments. In Proceedings of CIKM, pages 102--111, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Justin Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In Proceedings of SIGIR, pages 307--314, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The effect of assessor error on IR system evaluation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
          July 2010
          944 pages
          ISBN:9781450301534
          DOI:10.1145/1835449

          Copyright © 2010 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 July 2010

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader