research-article

On Fine-Grained Relevance Scales

Authors:
Kevin Roitero

University of Udine, Udine, Italy

University of Udine, Udine, Italy
View Profile

,
Eddy Maddalena

University of Southampton, Southampton, United Kingdom

University of Southampton, Southampton, United Kingdom
View Profile

,
Gianluca Demartini

University of Queensland, Brisbane, Australia

University of Queensland, Brisbane, Australia
View Profile

,
Stefano Mizzaro

University of Udine, Udine, Italy

University of Udine, Udine, Italy
View Profile

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information RetrievalJune 2018Pages 675–684https://doi.org/10.1145/3209978.3210052

Published:27 June 2018Publication History

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Pages 675–684

ABSTRACT

In Information Retrieval evaluation, the classical approach of adopting binary relevance judgments has been replaced by multi-level relevance judgments and by gain-based metrics leveraging such multi-level judgment scales. Recent work has also proposed and evaluated unbounded relevance scales by means of Magnitude Estimation (ME) and compared them with multi-level scales. While ME brings advantages like the ability for assessors to always judge the next document as having higher or lower relevance than any of the documents they have judged so far, it also comes with some drawbacks. For example, it is not a natural approach for human assessors to judge items as they are used to do on the Web (e.g., 5-star rating). In this work, we propose and experimentally evaluate a bounded and fine-grained relevance scale having many of the advantages and dealing with some of the issues of ME. We collect relevance judgments over a 100-level relevance scale (S100) by means of a large-scale crowdsourcing experiment and compare the results with other relevance scales (binary, 4-level, and ME) showing the benefit of fine-grained scales over both coarse-grained and unbounded scales as well as highlighting some new results on ME. Our results show that S100 maintains the flexibility of unbounded scales like ME in providing assessors with ample choice when judging document relevance (i.e., assessors can fit relevance judgments in between of previously given judgments). It also allows assessors to judge on a more familiar scale (e.g., on 10 levels) and to perform efficiently since the very first judging task.

References

Omar Alonso and Stefano Mizzaro . 2012. Using crowdsourcing for TREC relevance assessment. Inf. Process. Manage. Vol. 48, 6 (2012), 1053--1066. Google ScholarDigital Library
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan . 2009. Expected Reciprocal Rank for Graded Relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM '09). ACM, New York, NY, USA, 621--630. Google ScholarDigital Library
Charles LA Clarke, Nick Craswell, and Ian Soboroff . 2004. Overview of the TREC 2004 Terabyte Track.. In TREC, Vol. Vol. 4. 74.Google Scholar
Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M Voorhees . 2015. TREC 2014 web track overview. Technical Report. MICHIGAN UNIV ANN ARBOR.Google Scholar
Carsten Eickhoff . 2018. Cognitive Biases in Crowdsourcing. In WSDM 2018. To appear. Google ScholarDigital Library
Ujwal Gadiraju, Alessandro Checco, Neha Gupta, and Gianluca Demartini . 2017. Modus Operandi of Crowd Workers: The Invisible Role of Microtask Work Environments. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. Vol. 1, 3, Article bibinfoarticleno49 (Sept. . 2017), bibinfonumpages29 pages. Google ScholarDigital Library
George A Gescheider . 2013. Psychophysics: the fundamentals. Psychology Press.Google Scholar
Mehdi Hosseini, Ingemar J. Cox, Natasa Milic-Frayling, Gabriella Kazai, and Vishwa Vinay . 2012. On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012, Barcelona, Spain, April 1--5, 2012. Proceedings. 182--194. Google ScholarDigital Library
Quan Huynh-Thu, Marie-Neige Garcia, Filippo Speranza, Philip Corriveau, and Alexander Raake . 2011. Study of rating scales for subjective quality assessment of high-definition video. IEEE Transactions on Broadcasting Vol. 57, 1 (2011), 1--14.Google ScholarCross Ref
Kalervo J"arvelin and Jaana Kek"al"ainen . 2002. Cumulated Gain-based Evaluation of IR Techniques. ACM Trans. Inf. Syst. Vol. 20, 4 (Oct. . 2002), 422--446. Google ScholarDigital Library
Kalervo J"arvelin and Jaana Kek"al"ainen . 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) Vol. 20, 4 (2002), 422--446. Google ScholarDigital Library
Jiepu Jiang, Daqing He, and James Allan . 2017. Comparing In Situ and Multidimensional Relevance Judgments Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17). ACM, New York, NY, USA, 405--414. Google ScholarDigital Library
Klaus Krippendorff . 2007. Computing Krippendorff's alpha reliability. Departmental papers (ASC) (2007), 43.Google Scholar
Eddy Maddalena, Marco Basaldella, Dario De Nart, Dante Degl'Innocenti, Stefano Mizzaro, and Gianluca Demartini . 2016. Crowdsourcing relevance assessments: The unexpected benefits of limiting the time to judge Fourth AAAI Conference on Human Computation and Crowdsourcing.Google Scholar
Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin . 2017. On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation. ACM Transactions on Information Systems (TOIS) Vol. 35, 3 (2017), 19. Google ScholarDigital Library
Tyler McDonnell, Matthew Lease, Mucahid Kutlu, and Tamer Elsayed . 2016. Why is that relevant? Collecting annotator rationales for relevance judgments Fourth AAAI Conference on Human Computation and Crowdsourcing.Google Scholar
Tetsuya Sakai . 2007. On the Reliability of Information Retrieval Metrics Based on Graded Relevance. Inf. Process. Manage. Vol. 43, 2 (March . 2007), 531--548. Google ScholarDigital Library
Tefko Saracevic . 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. Journal of the American Society for Information Science and Technology Vol. 58, 13 (2007), 1915--1933. Google ScholarDigital Library
Eero Sormunen . 2002. Liberal Relevance Criteria of TREC -: Counting on Negligible Documents? Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '02). ACM, New York, NY, USA, 324--330. Google ScholarDigital Library
Rong Tang, William M Shaw Jr, and Jack L Vevea . 1999. Towards the identification of the optimal number of relevance categories. Journal of the Association for Information Science and Technology Vol. 50, 3 (1999), 254. Google ScholarDigital Library
Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena . 2015. The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 565--574. Google ScholarDigital Library
Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi . 2014. Community-based Bayesian Aggregation Models for Crowdsourcing Proceedings of the 23rd International Conference on World Wide Web (WWW '14). ACM, 155--164. Google ScholarDigital Library

Index Terms

On Fine-Grained Relevance Scales
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment

Recommendations

On Transforming Relevance Scales
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Information Retrieval (IR) researchers have often used existing IR evaluation collections and transformed the relevance scale in which judgments have been collected, e.g., to use metrics that assume binary judgments like Mean Average Precision. Such ...
Read More
On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation
Abstract
Relevance is a key concept in information retrieval and widely used for the evaluation of search systems using test collections. We present a comprehensive study of the effect of the choice of relevance scales on the evaluation of ...
Highlights
- We collect relevance judgments for 4 crowdsourced scales.
- We compare the crowd ...
Read More
On the role of human and machine metadata in relevance judgment tasks
Abstract
In order to evaluate the effectiveness of Information Retrieval (IR) systems it is key to collect relevance judgments from human assessors. Crowdsourcing has successfully been used as a method to scale-up the collection of manual ...
Highlights
- Introducing metadata improves the efficiency of crowd workers performing relevance judgements.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
June 2018
1509 pages
ISBN:9781450356572
DOI:10.1145/3209978
General Chairs:
Kevyn Collins-Thompson
University of Michigan, United States
,
Qiaozhu Mei
University of Michigan, United States
,
Program Chairs:
Brian Davison
Lehigh University, United States
,
Yiqun Liu
Tsinghua University, China
,
Emine Yilmaz
University College London, United Kingdom
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 June 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ir evaluation
relevance scales
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '18 Paper Acceptance Rate86of409submissions,21%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 388
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On Fine-Grained Relevance Scales

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

On Transforming Relevance Scales

On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation

On the role of human and machine metadata in relevance judgment tasks