skip to main content
10.1145/2637002.2637025acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiixConference Proceedingsconference-collections
research-article

A qualitative exploration of secondary assessor relevance judging behavior

Published:26 August 2014Publication History

ABSTRACT

Secondary assessors frequently differ in their relevance judgments. Primary assessors are those that originate a search topic and whose judgments truly reflect the assessor's relevance criteria. Secondary assessors do not originate the search and must instead attempt to make relevance judgments based on a description of what is and is not relevant. Secondary assessors may be hired to help in the construction of test collections. Currently our knowledge about secondary assessors is largely limited to quantitative measurements of the differences between judgments produced by secondary and primary assessors. In order to better understand the behavior of secondary assessors, we conducted a think-aloud study of secondary assessing behavior. We asked secondary assessors to think-aloud their thoughts as they judged documents. The think-aloud method gives us insight into how relevance decisions are made. We found that assessors are not always certain in their judgments. In the extreme, secondary assessors are forced to make guesses concerning the relevance of documents. We present many reasons and examples of why secondary assessors produce differing relevance judgments. These differences result from the interactions between the search topic, the secondary assessor, the document being judged, and can even apparently be caused by a primary assessor's error in judging relevance. To improve the quality of secondary assessor judgments, we recommend that relevance assessing systems allow for the collection of assessor's certainty and provide a means to help assessors efficiently express their judgment rationale.

References

  1. P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In SIGIR, pages 667--674. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. E. Box, W. G. Hunter, and J. S. Hunter. Statistics for experimenters. Wiley, 1978.Google ScholarGoogle Scholar
  3. J. L. Branch. The trouble with think alouds: Generating data using concurrent verbal protocols. Proc. of CAIS, 2000.Google ScholarGoogle Scholar
  4. P. Chandar, W. Webber, and B. Carterette. Document features predicting assessor disagreement. In SIGIR, pages 745--748. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. W. Cleverdon. The effect of variations in relevance assessments in comparative experimental tests of index languages. Technical report, Cranfield University; Aslib, 1970.Google ScholarGoogle Scholar
  6. G. V. Cormack, C. L. Clarke, and S. Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In SIGIR, pages 758--759. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. N. Efthimiadis and M. A. Hotchkiss. Legal discovery: Does domain expertise matter? Proceedings of the American Society for Information Science and Technology, 45(1):1--2, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  8. M. R. Grossman and G. V. Cormak. Inconsistent responsiveness determination in document review: Difference of opinion or human error. Pace Law Review, 32:267, 2012.Google ScholarGoogle Scholar
  9. S. P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. JASIS, 47(1):37--49, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Jethani. Effect of prevalence on relevance assessing behaviour. Master's thesis, University of Waterloo, 2011.Google ScholarGoogle Scholar
  11. M. E. Lesk and G. Salton. Relevance assessments and retrieval system evaluation. Information Storage and Retrieval, 4(4):343--359, 1968.Google ScholarGoogle ScholarCross RefCross Ref
  12. I. Ruthven, M. Baillie, and D. Elsweiler. The relative effects of knowledge, interest and confidence in assessing relevance. Journal of Documentation, 63(4):482--504, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  13. F. Scholer, D. Kelly, W.-C. Wu, H. S. Lee, and W. Webber. The effect of threshold priming and need for cognition on relevance calibration and assessment. In SIGIR, pages 623--632. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements. In SIGIR, pages 1063--1072. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. D. Smucker and C. Jethani. Human performance and retrieval precision revisited. In SIGIR, pages 595--602. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. D. Smucker and C. Jethani. The crowd vs. the lab: A comparison of crowd-sourced and university laboratory participant behaviour. In Proceedings of the SIGIR 2011 Workshop on crowdsourcing for information retrieval, 2011.Google ScholarGoogle Scholar
  17. M. W. van Someren, Y. F. Barnard, and J. A. Sandberg. The Think Aloud Method. Academic Press, 1994.Google ScholarGoogle Scholar
  18. R. Villa and M. Halvey. Is relevance hard work?: evaluating the effort of making relevant assessments. In SIGIR, pages 765--768. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5):697--716, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. M. Voorhees. Overview of the TREC 2005 Robust Retrieval Track. In 14th Text REtrieval Conference, 2005.Google ScholarGoogle Scholar
  21. E. M. Voorhees and D. K. Harman. TREC: Experiment and Evaluation in Information Retrieval. The MIT Press, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Wang. Accuracy, agreement, speed, and perceived difficulty of users' relevance judgments for e-discovery. In Proceedings of SIGIR Information Retrieval for E-Discovery Workshop, 2011.Google ScholarGoogle Scholar
  23. J. Wang and D. Soergel. A user study of relevance judgments for e-discovery. Proceedings of the American Society for Information Science and Technology, 47(1):1--10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. W. Webber, P. Chandar, and B. Carterette. Alternative assessor disagreement and retrieval depth. In CIKM, pages 125--134. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A qualitative exploration of secondary assessor relevance judging behavior

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        IIiX '14: Proceedings of the 5th Information Interaction in Context Symposium
        August 2014
        368 pages
        ISBN:9781450329767
        DOI:10.1145/2637002

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 August 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        IIiX '14 Paper Acceptance Rate21of45submissions,47%Overall Acceptance Rate21of45submissions,47%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader