skip to main content
10.1145/2396761.2396779acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

An analysis of systematic judging errors in information retrieval

Published:29 October 2012Publication History

ABSTRACT

Test collections are powerful mechanisms for the evaluation and optimization of information retrieval systems. However, there is reported evidence that experiment outcomes can be affected by changes to the judging guidelines or changes in the judge population. This paper examines such effects in a web search setting, comparing the judgments of four groups of judges: NIST Web Track judges, untrained crowd workers and two groups of trained judges of a commercial search engine. Our goal is to identify systematic judging errors by comparing the labels contributed by the different groups, working under the same or different judging guidelines. In particular, we focus on detecting systematic differences in judging depending on specific characteristics of the queries and URLs. For example, we ask whether a given population of judges, working under a given set of judging guidelines, are more likely to consistently overrate Wikipedia pages than another group judging under the same instructions. Our approach is to identify judging errors with respect to a consensus set, a judged gold set and a set of user clicks. We further demonstrate how such biases can affect the training of retrieval systems.

References

  1. O. Alonso and R. Baeza-Yates. Design and implementation of relevance assessments using crowdsourcing. In Proc. of the 33rd European Conference on Advances in Information Retrieval, ECIR'11, pages 153--164. Springer-Verlag, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In Proc. of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'08, pages 667--674. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Borlund. The concept of relevance in IR. Journal of the American Society for Information Science and Technology, 54(10):913--925, May 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. J. C. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions. In NIPS, pages 193--200. MIT Press, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Carterette and I. Soboroff. The effect of assessor error on IR system evaluation. In Proc. of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'10, pages 539--546. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. L. Clarke, N. Craswell, I. Soboroff, and A. Ashkan. A comparative analysis of cascade measures for novelty and diversity. In Proceedings of the fourth ACM international conference on Web search and data mining, WSDM'11, pages 75--84. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Cuadra and R. Katter. The relevance of relevance assessment. Proc. of the American Documentation Institute, page 95--99.Google ScholarGoogle Scholar
  8. A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), pages 20--28.Google ScholarGoogle Scholar
  9. C. Grady and M. Lease. Crowdsourcing document relevance assessment with mechanical turk. In Proc. of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 172--179, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Hawking and N. Craswell. The very large collection and web tracks. In E. Voorhees and D. Harman, editors, TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005.Google ScholarGoogle Scholar
  11. J. Howe. Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business. Crown Publishing Group, New York, NY, USA, 1 edition, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proc. of the ACM SIGKDD Workshop on Human Computation, HCOMP'10, pages 64--67. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. H. Jiyin He, Krisztian Balog and E. Meij. Heuristic ranking and diversification of web documents. In TREC, 2009.Google ScholarGoogle Scholar
  14. G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. Crowdsourcing for book search evaluation: impact of HIT design on comparative system ranking. In Proc. of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, pages 205--214. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. A. Kinney, S. B. Huffman, and J. Zhai. How evaluator domain expertise affects search result relevance judgments. In Proc. of the 17th ACM Conference on Information and Knowledge Management, CIKM'08, pages 591--598. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Pronin. Perception and misperception of bias in human judgment. Trends in cognitive sciences, 11(1):37--43, 2007.Google ScholarGoogle Scholar
  17. T. Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. part iii: Behavior and effects of relevance. J. Am. Soc. Inf. Sci. Technol., 58(13):2126--2144, Nov. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements. In Proc. of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, pages 1063--1072. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proc. of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'98, pages 315--323. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. M. Voorhees. Evaluation by highly relevant documents. In Proc. of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'01, pages 74--82. ACM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. M. Voorhees and D. K. Harman, editors. TREC: Experimentation and Evaluation in Information Retrieval. MIT Press, 2005.Google ScholarGoogle Scholar
  22. W. Webber, D. W. Oard, F. Scholer, and B. Hedin. Assessor error in stratified evaluation. In Proc. of the 19th ACM International Conference on Information and Knowledge Management, CIKM'10, pages 539--548. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An analysis of systematic judging errors in information retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
      October 2012
      2840 pages
      ISBN:9781450311564
      DOI:10.1145/2396761

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 October 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader