research-article

A qualitative exploration of secondary assessor relevance judging behavior

Authors:
Aiman L. Al-Harbi

University of Waterloo, Canada

University of Waterloo, Canada
View Profile

,
Mark D. Smucker

University of Waterloo, Canada

University of Waterloo, Canada
View Profile

IIiX '14: Proceedings of the 5th Information Interaction in Context SymposiumAugust 2014Pages 195–204https://doi.org/10.1145/2637002.2637025

Published:26 August 2014Publication History

IIiX '14: Proceedings of the 5th Information Interaction in Context Symposium

Pages 195–204

ABSTRACT

Secondary assessors frequently differ in their relevance judgments. Primary assessors are those that originate a search topic and whose judgments truly reflect the assessor's relevance criteria. Secondary assessors do not originate the search and must instead attempt to make relevance judgments based on a description of what is and is not relevant. Secondary assessors may be hired to help in the construction of test collections. Currently our knowledge about secondary assessors is largely limited to quantitative measurements of the differences between judgments produced by secondary and primary assessors. In order to better understand the behavior of secondary assessors, we conducted a think-aloud study of secondary assessing behavior. We asked secondary assessors to think-aloud their thoughts as they judged documents. The think-aloud method gives us insight into how relevance decisions are made. We found that assessors are not always certain in their judgments. In the extreme, secondary assessors are forced to make guesses concerning the relevance of documents. We present many reasons and examples of why secondary assessors produce differing relevance judgments. These differences result from the interactions between the search topic, the secondary assessor, the document being judged, and can even apparently be caused by a primary assessor's error in judging relevance. To improve the quality of secondary assessor judgments, we recommend that relevance assessing systems allow for the collection of assessor's certainty and provide a means to help assessors efficiently express their judgment rationale.

References

P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In SIGIR, pages 667--674. ACM, 2008. Google ScholarDigital Library
G. E. Box, W. G. Hunter, and J. S. Hunter. Statistics for experimenters. Wiley, 1978.Google Scholar
J. L. Branch. The trouble with think alouds: Generating data using concurrent verbal protocols. Proc. of CAIS, 2000.Google Scholar
P. Chandar, W. Webber, and B. Carterette. Document features predicting assessor disagreement. In SIGIR, pages 745--748. ACM, 2013. Google ScholarDigital Library
C. W. Cleverdon. The effect of variations in relevance assessments in comparative experimental tests of index languages. Technical report, Cranfield University; Aslib, 1970.Google Scholar
G. V. Cormack, C. L. Clarke, and S. Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In SIGIR, pages 758--759. ACM, 2009. Google ScholarDigital Library
E. N. Efthimiadis and M. A. Hotchkiss. Legal discovery: Does domain expertise matter? Proceedings of the American Society for Information Science and Technology, 45(1):1--2, 2008.Google ScholarCross Ref
M. R. Grossman and G. V. Cormak. Inconsistent responsiveness determination in document review: Difference of opinion or human error. Pace Law Review, 32:267, 2012.Google Scholar
S. P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. JASIS, 47(1):37--49, 1996. Google ScholarDigital Library
C. Jethani. Effect of prevalence on relevance assessing behaviour. Master's thesis, University of Waterloo, 2011.Google Scholar
M. E. Lesk and G. Salton. Relevance assessments and retrieval system evaluation. Information Storage and Retrieval, 4(4):343--359, 1968.Google ScholarCross Ref
I. Ruthven, M. Baillie, and D. Elsweiler. The relative effects of knowledge, interest and confidence in assessing relevance. Journal of Documentation, 63(4):482--504, 2007.Google ScholarCross Ref
F. Scholer, D. Kelly, W.-C. Wu, H. S. Lee, and W. Webber. The effect of threshold priming and need for cognition on relevance calibration and assessment. In SIGIR, pages 623--632. ACM, 2013. Google ScholarDigital Library
F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements. In SIGIR, pages 1063--1072. ACM, 2011. Google ScholarDigital Library
M. D. Smucker and C. Jethani. Human performance and retrieval precision revisited. In SIGIR, pages 595--602. ACM, 2010. Google ScholarDigital Library
M. D. Smucker and C. Jethani. The crowd vs. the lab: A comparison of crowd-sourced and university laboratory participant behaviour. In Proceedings of the SIGIR 2011 Workshop on crowdsourcing for information retrieval, 2011.Google Scholar
M. W. van Someren, Y. F. Barnard, and J. A. Sandberg. The Think Aloud Method. Academic Press, 1994.Google Scholar
R. Villa and M. Halvey. Is relevance hard work?: evaluating the effort of making relevant assessments. In SIGIR, pages 765--768. ACM, 2013. Google ScholarDigital Library
E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5):697--716, 2000. Google ScholarDigital Library
E. M. Voorhees. Overview of the TREC 2005 Robust Retrieval Track. In 14th Text REtrieval Conference, 2005.Google Scholar
E. M. Voorhees and D. K. Harman. TREC: Experiment and Evaluation in Information Retrieval. The MIT Press, 2005. Google ScholarDigital Library
J. Wang. Accuracy, agreement, speed, and perceived difficulty of users' relevance judgments for e-discovery. In Proceedings of SIGIR Information Retrieval for E-Discovery Workshop, 2011.Google Scholar
J. Wang and D. Soergel. A user study of relevance judgments for e-discovery. Proceedings of the American Society for Information Science and Technology, 47(1):1--10, 2010. Google ScholarDigital Library
W. Webber, P. Chandar, and B. Carterette. Alternative assessor disagreement and retrieval depth. In CIKM, pages 125--134. ACM, 2012. Google ScholarDigital Library

Index Terms

A qualitative exploration of secondary assessor relevance judging behavior
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Are Secondary Assessors Uncertain When They Disagree About Relevance Judgements?
CHIIR '16: Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval

The collection of relevance judgements by assessors is important for many information retrieval (IR) tasks. In addition to the construction of test collections, relevance judging is critical to e-discovery and other applications where many assessors are ...
Read More
Time to judge relevance as an indicator of assessor error
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

When human assessors judge documents for their relevance to a search topic, it is possible for errors in judging to occur. As part of the analysis of the data collected from a 48 participant user study, we have discovered that when the participants made ...
Read More
Tolerance of Effectiveness Measures to Relevance Judging Errors
ECIR 2014: Proceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 8416

Crowdsourcing relevance judgments for test collection construction is attractive because the practice has the possibility of being more affordable than hiring high quality assessors. A problem faced by all crowdsourced judgments --- even judgments ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
IIiX '14: Proceedings of the 5th Information Interaction in Context Symposium
August 2014
368 pages
ISBN:9781450329767
DOI:10.1145/2637002
Conference Chairs:
David Elsweiler
University of Regensburg
,
Bernd Ludwig
University of Regensburg
,
Program Chairs:
Leif Azzopardi
University of Glasgow
,
Max L. Wilson
University of Nottingham
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 August 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
evaluation
search
secondary assessors
Qualifiers
- research-article
Conference

Acceptance Rates
IIiX '14 Paper Acceptance Rate21of45submissions,47%Overall Acceptance Rate21of45submissions,47%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 93
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A qualitative exploration of secondary assessor relevance judging behavior

IIiX '14: Proceedings of the 5th Information Interaction in Context Symposium

ABSTRACT

References

Cited By

Index Terms

Recommendations

Are Secondary Assessors Uncertain When They Disagree About Relevance Judgements?

Time to judge relevance as an indicator of assessor error

Tolerance of Effectiveness Measures to Relevance Judging Errors