ABSTRACT
Secondary assessors frequently differ in their relevance judgments. Primary assessors are those that originate a search topic and whose judgments truly reflect the assessor's relevance criteria. Secondary assessors do not originate the search and must instead attempt to make relevance judgments based on a description of what is and is not relevant. Secondary assessors may be hired to help in the construction of test collections. Currently our knowledge about secondary assessors is largely limited to quantitative measurements of the differences between judgments produced by secondary and primary assessors. In order to better understand the behavior of secondary assessors, we conducted a think-aloud study of secondary assessing behavior. We asked secondary assessors to think-aloud their thoughts as they judged documents. The think-aloud method gives us insight into how relevance decisions are made. We found that assessors are not always certain in their judgments. In the extreme, secondary assessors are forced to make guesses concerning the relevance of documents. We present many reasons and examples of why secondary assessors produce differing relevance judgments. These differences result from the interactions between the search topic, the secondary assessor, the document being judged, and can even apparently be caused by a primary assessor's error in judging relevance. To improve the quality of secondary assessor judgments, we recommend that relevance assessing systems allow for the collection of assessor's certainty and provide a means to help assessors efficiently express their judgment rationale.
- P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In SIGIR, pages 667--674. ACM, 2008. Google ScholarDigital Library
- G. E. Box, W. G. Hunter, and J. S. Hunter. Statistics for experimenters. Wiley, 1978.Google Scholar
- J. L. Branch. The trouble with think alouds: Generating data using concurrent verbal protocols. Proc. of CAIS, 2000.Google Scholar
- P. Chandar, W. Webber, and B. Carterette. Document features predicting assessor disagreement. In SIGIR, pages 745--748. ACM, 2013. Google ScholarDigital Library
- C. W. Cleverdon. The effect of variations in relevance assessments in comparative experimental tests of index languages. Technical report, Cranfield University; Aslib, 1970.Google Scholar
- G. V. Cormack, C. L. Clarke, and S. Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In SIGIR, pages 758--759. ACM, 2009. Google ScholarDigital Library
- E. N. Efthimiadis and M. A. Hotchkiss. Legal discovery: Does domain expertise matter? Proceedings of the American Society for Information Science and Technology, 45(1):1--2, 2008.Google ScholarCross Ref
- M. R. Grossman and G. V. Cormak. Inconsistent responsiveness determination in document review: Difference of opinion or human error. Pace Law Review, 32:267, 2012.Google Scholar
- S. P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. JASIS, 47(1):37--49, 1996. Google ScholarDigital Library
- C. Jethani. Effect of prevalence on relevance assessing behaviour. Master's thesis, University of Waterloo, 2011.Google Scholar
- M. E. Lesk and G. Salton. Relevance assessments and retrieval system evaluation. Information Storage and Retrieval, 4(4):343--359, 1968.Google ScholarCross Ref
- I. Ruthven, M. Baillie, and D. Elsweiler. The relative effects of knowledge, interest and confidence in assessing relevance. Journal of Documentation, 63(4):482--504, 2007.Google ScholarCross Ref
- F. Scholer, D. Kelly, W.-C. Wu, H. S. Lee, and W. Webber. The effect of threshold priming and need for cognition on relevance calibration and assessment. In SIGIR, pages 623--632. ACM, 2013. Google ScholarDigital Library
- F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements. In SIGIR, pages 1063--1072. ACM, 2011. Google ScholarDigital Library
- M. D. Smucker and C. Jethani. Human performance and retrieval precision revisited. In SIGIR, pages 595--602. ACM, 2010. Google ScholarDigital Library
- M. D. Smucker and C. Jethani. The crowd vs. the lab: A comparison of crowd-sourced and university laboratory participant behaviour. In Proceedings of the SIGIR 2011 Workshop on crowdsourcing for information retrieval, 2011.Google Scholar
- M. W. van Someren, Y. F. Barnard, and J. A. Sandberg. The Think Aloud Method. Academic Press, 1994.Google Scholar
- R. Villa and M. Halvey. Is relevance hard work?: evaluating the effort of making relevant assessments. In SIGIR, pages 765--768. ACM, 2013. Google ScholarDigital Library
- E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5):697--716, 2000. Google ScholarDigital Library
- E. M. Voorhees. Overview of the TREC 2005 Robust Retrieval Track. In 14th Text REtrieval Conference, 2005.Google Scholar
- E. M. Voorhees and D. K. Harman. TREC: Experiment and Evaluation in Information Retrieval. The MIT Press, 2005. Google ScholarDigital Library
- J. Wang. Accuracy, agreement, speed, and perceived difficulty of users' relevance judgments for e-discovery. In Proceedings of SIGIR Information Retrieval for E-Discovery Workshop, 2011.Google Scholar
- J. Wang and D. Soergel. A user study of relevance judgments for e-discovery. Proceedings of the American Society for Information Science and Technology, 47(1):1--10, 2010. Google ScholarDigital Library
- W. Webber, P. Chandar, and B. Carterette. Alternative assessor disagreement and retrieval depth. In CIKM, pages 125--134. ACM, 2012. Google ScholarDigital Library
Index Terms
- A qualitative exploration of secondary assessor relevance judging behavior
Recommendations
Are Secondary Assessors Uncertain When They Disagree About Relevance Judgements?
CHIIR '16: Proceedings of the 2016 ACM on Conference on Human Information Interaction and RetrievalThe collection of relevance judgements by assessors is important for many information retrieval (IR) tasks. In addition to the construction of test collections, relevance judging is critical to e-discovery and other applications where many assessors are ...
Time to judge relevance as an indicator of assessor error
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalWhen human assessors judge documents for their relevance to a search topic, it is possible for errors in judging to occur. As part of the analysis of the data collected from a 48 participant user study, we have discovered that when the participants made ...
Tolerance of Effectiveness Measures to Relevance Judging Errors
ECIR 2014: Proceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 8416Crowdsourcing relevance judgments for test collection construction is attractive because the practice has the possibility of being more affordable than hiring high quality assessors. A problem faced by all crowdsourced judgments --- even judgments ...
Comments