ABSTRACT
In Information Retrieval evaluation, the classical approach of adopting binary relevance judgments has been replaced by multi-level relevance judgments and by gain-based metrics leveraging such multi-level judgment scales. Recent work has also proposed and evaluated unbounded relevance scales by means of Magnitude Estimation (ME) and compared them with multi-level scales. While ME brings advantages like the ability for assessors to always judge the next document as having higher or lower relevance than any of the documents they have judged so far, it also comes with some drawbacks. For example, it is not a natural approach for human assessors to judge items as they are used to do on the Web (e.g., 5-star rating). In this work, we propose and experimentally evaluate a bounded and fine-grained relevance scale having many of the advantages and dealing with some of the issues of ME. We collect relevance judgments over a 100-level relevance scale (S100) by means of a large-scale crowdsourcing experiment and compare the results with other relevance scales (binary, 4-level, and ME) showing the benefit of fine-grained scales over both coarse-grained and unbounded scales as well as highlighting some new results on ME. Our results show that S100 maintains the flexibility of unbounded scales like ME in providing assessors with ample choice when judging document relevance (i.e., assessors can fit relevance judgments in between of previously given judgments). It also allows assessors to judge on a more familiar scale (e.g., on 10 levels) and to perform efficiently since the very first judging task.
- Omar Alonso and Stefano Mizzaro . 2012. Using crowdsourcing for TREC relevance assessment. Inf. Process. Manage. Vol. 48, 6 (2012), 1053--1066. Google ScholarDigital Library
- Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan . 2009. Expected Reciprocal Rank for Graded Relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM '09). ACM, New York, NY, USA, 621--630. Google ScholarDigital Library
- Charles LA Clarke, Nick Craswell, and Ian Soboroff . 2004. Overview of the TREC 2004 Terabyte Track.. In TREC, Vol. Vol. 4. 74.Google Scholar
- Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M Voorhees . 2015. TREC 2014 web track overview. Technical Report. MICHIGAN UNIV ANN ARBOR.Google Scholar
- Carsten Eickhoff . 2018. Cognitive Biases in Crowdsourcing. In WSDM 2018. To appear. Google ScholarDigital Library
- Ujwal Gadiraju, Alessandro Checco, Neha Gupta, and Gianluca Demartini . 2017. Modus Operandi of Crowd Workers: The Invisible Role of Microtask Work Environments. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. Vol. 1, 3, Article bibinfoarticleno49 (Sept. . 2017), bibinfonumpages29 pages. Google ScholarDigital Library
- George A Gescheider . 2013. Psychophysics: the fundamentals. Psychology Press.Google Scholar
- Mehdi Hosseini, Ingemar J. Cox, Natasa Milic-Frayling, Gabriella Kazai, and Vishwa Vinay . 2012. On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012, Barcelona, Spain, April 1--5, 2012. Proceedings. 182--194. Google ScholarDigital Library
- Quan Huynh-Thu, Marie-Neige Garcia, Filippo Speranza, Philip Corriveau, and Alexander Raake . 2011. Study of rating scales for subjective quality assessment of high-definition video. IEEE Transactions on Broadcasting Vol. 57, 1 (2011), 1--14.Google ScholarCross Ref
- Kalervo J"arvelin and Jaana Kek"al"ainen . 2002. Cumulated Gain-based Evaluation of IR Techniques. ACM Trans. Inf. Syst. Vol. 20, 4 (Oct. . 2002), 422--446. Google ScholarDigital Library
- Kalervo J"arvelin and Jaana Kek"al"ainen . 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) Vol. 20, 4 (2002), 422--446. Google ScholarDigital Library
- Jiepu Jiang, Daqing He, and James Allan . 2017. Comparing In Situ and Multidimensional Relevance Judgments Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17). ACM, New York, NY, USA, 405--414. Google ScholarDigital Library
- Klaus Krippendorff . 2007. Computing Krippendorff's alpha reliability. Departmental papers (ASC) (2007), 43.Google Scholar
- Eddy Maddalena, Marco Basaldella, Dario De Nart, Dante Degl'Innocenti, Stefano Mizzaro, and Gianluca Demartini . 2016. Crowdsourcing relevance assessments: The unexpected benefits of limiting the time to judge Fourth AAAI Conference on Human Computation and Crowdsourcing.Google Scholar
- Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin . 2017. On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation. ACM Transactions on Information Systems (TOIS) Vol. 35, 3 (2017), 19. Google ScholarDigital Library
- Tyler McDonnell, Matthew Lease, Mucahid Kutlu, and Tamer Elsayed . 2016. Why is that relevant? Collecting annotator rationales for relevance judgments Fourth AAAI Conference on Human Computation and Crowdsourcing.Google Scholar
- Tetsuya Sakai . 2007. On the Reliability of Information Retrieval Metrics Based on Graded Relevance. Inf. Process. Manage. Vol. 43, 2 (March . 2007), 531--548. Google ScholarDigital Library
- Tefko Saracevic . 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. Journal of the American Society for Information Science and Technology Vol. 58, 13 (2007), 1915--1933. Google ScholarDigital Library
- Eero Sormunen . 2002. Liberal Relevance Criteria of TREC -: Counting on Negligible Documents? Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '02). ACM, New York, NY, USA, 324--330. Google ScholarDigital Library
- Rong Tang, William M Shaw Jr, and Jack L Vevea . 1999. Towards the identification of the optimal number of relevance categories. Journal of the Association for Information Science and Technology Vol. 50, 3 (1999), 254. Google ScholarDigital Library
- Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena . 2015. The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 565--574. Google ScholarDigital Library
- Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi . 2014. Community-based Bayesian Aggregation Models for Crowdsourcing Proceedings of the 23rd International Conference on World Wide Web (WWW '14). ACM, 155--164. Google ScholarDigital Library
Index Terms
- On Fine-Grained Relevance Scales
Recommendations
On Transforming Relevance Scales
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge ManagementInformation Retrieval (IR) researchers have often used existing IR evaluation collections and transformed the relevance scale in which judgments have been collected, e.g., to use metrics that assume binary judgments like Mean Average Precision. Such ...
On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation
AbstractRelevance is a key concept in information retrieval and widely used for the evaluation of search systems using test collections. We present a comprehensive study of the effect of the choice of relevance scales on the evaluation of ...
Highlights- We collect relevance judgments for 4 crowdsourced scales.
- We compare the crowd ...
On the role of human and machine metadata in relevance judgment tasks
AbstractIn order to evaluate the effectiveness of Information Retrieval (IR) systems it is key to collect relevance judgments from human assessors. Crowdsourcing has successfully been used as a method to scale-up the collection of manual ...
Highlights- Introducing metadata improves the efficiency of crowd workers performing relevance judgements.
Comments