skip to main content
10.1145/3209978.3210052acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

On Fine-Grained Relevance Scales

Published:27 June 2018Publication History

ABSTRACT

In Information Retrieval evaluation, the classical approach of adopting binary relevance judgments has been replaced by multi-level relevance judgments and by gain-based metrics leveraging such multi-level judgment scales. Recent work has also proposed and evaluated unbounded relevance scales by means of Magnitude Estimation (ME) and compared them with multi-level scales. While ME brings advantages like the ability for assessors to always judge the next document as having higher or lower relevance than any of the documents they have judged so far, it also comes with some drawbacks. For example, it is not a natural approach for human assessors to judge items as they are used to do on the Web (e.g., 5-star rating). In this work, we propose and experimentally evaluate a bounded and fine-grained relevance scale having many of the advantages and dealing with some of the issues of ME. We collect relevance judgments over a 100-level relevance scale (S100) by means of a large-scale crowdsourcing experiment and compare the results with other relevance scales (binary, 4-level, and ME) showing the benefit of fine-grained scales over both coarse-grained and unbounded scales as well as highlighting some new results on ME. Our results show that S100 maintains the flexibility of unbounded scales like ME in providing assessors with ample choice when judging document relevance (i.e., assessors can fit relevance judgments in between of previously given judgments). It also allows assessors to judge on a more familiar scale (e.g., on 10 levels) and to perform efficiently since the very first judging task.

References

  1. Omar Alonso and Stefano Mizzaro . 2012. Using crowdsourcing for TREC relevance assessment. Inf. Process. Manage. Vol. 48, 6 (2012), 1053--1066. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan . 2009. Expected Reciprocal Rank for Graded Relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM '09). ACM, New York, NY, USA, 621--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Charles LA Clarke, Nick Craswell, and Ian Soboroff . 2004. Overview of the TREC 2004 Terabyte Track.. In TREC, Vol. Vol. 4. 74.Google ScholarGoogle Scholar
  4. Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M Voorhees . 2015. TREC 2014 web track overview. Technical Report. MICHIGAN UNIV ANN ARBOR.Google ScholarGoogle Scholar
  5. Carsten Eickhoff . 2018. Cognitive Biases in Crowdsourcing. In WSDM 2018. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ujwal Gadiraju, Alessandro Checco, Neha Gupta, and Gianluca Demartini . 2017. Modus Operandi of Crowd Workers: The Invisible Role of Microtask Work Environments. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. Vol. 1, 3, Article bibinfoarticleno49 (Sept. . 2017), bibinfonumpages29 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. George A Gescheider . 2013. Psychophysics: the fundamentals. Psychology Press.Google ScholarGoogle Scholar
  8. Mehdi Hosseini, Ingemar J. Cox, Natasa Milic-Frayling, Gabriella Kazai, and Vishwa Vinay . 2012. On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012, Barcelona, Spain, April 1--5, 2012. Proceedings. 182--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Quan Huynh-Thu, Marie-Neige Garcia, Filippo Speranza, Philip Corriveau, and Alexander Raake . 2011. Study of rating scales for subjective quality assessment of high-definition video. IEEE Transactions on Broadcasting Vol. 57, 1 (2011), 1--14.Google ScholarGoogle ScholarCross RefCross Ref
  10. Kalervo J"arvelin and Jaana Kek"al"ainen . 2002. Cumulated Gain-based Evaluation of IR Techniques. ACM Trans. Inf. Syst. Vol. 20, 4 (Oct. . 2002), 422--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kalervo J"arvelin and Jaana Kek"al"ainen . 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) Vol. 20, 4 (2002), 422--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jiepu Jiang, Daqing He, and James Allan . 2017. Comparing In Situ and Multidimensional Relevance Judgments Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17). ACM, New York, NY, USA, 405--414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Klaus Krippendorff . 2007. Computing Krippendorff's alpha reliability. Departmental papers (ASC) (2007), 43.Google ScholarGoogle Scholar
  14. Eddy Maddalena, Marco Basaldella, Dario De Nart, Dante Degl'Innocenti, Stefano Mizzaro, and Gianluca Demartini . 2016. Crowdsourcing relevance assessments: The unexpected benefits of limiting the time to judge Fourth AAAI Conference on Human Computation and Crowdsourcing.Google ScholarGoogle Scholar
  15. Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin . 2017. On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation. ACM Transactions on Information Systems (TOIS) Vol. 35, 3 (2017), 19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Tyler McDonnell, Matthew Lease, Mucahid Kutlu, and Tamer Elsayed . 2016. Why is that relevant? Collecting annotator rationales for relevance judgments Fourth AAAI Conference on Human Computation and Crowdsourcing.Google ScholarGoogle Scholar
  17. Tetsuya Sakai . 2007. On the Reliability of Information Retrieval Metrics Based on Graded Relevance. Inf. Process. Manage. Vol. 43, 2 (March . 2007), 531--548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Tefko Saracevic . 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. Journal of the American Society for Information Science and Technology Vol. 58, 13 (2007), 1915--1933. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Eero Sormunen . 2002. Liberal Relevance Criteria of TREC -: Counting on Negligible Documents? Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '02). ACM, New York, NY, USA, 324--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Rong Tang, William M Shaw Jr, and Jack L Vevea . 1999. Towards the identification of the optimal number of relevance categories. Journal of the Association for Information Science and Technology Vol. 50, 3 (1999), 254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena . 2015. The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 565--574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi . 2014. Community-based Bayesian Aggregation Models for Crowdsourcing Proceedings of the 23rd International Conference on World Wide Web (WWW '14). ACM, 155--164. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On Fine-Grained Relevance Scales

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
      June 2018
      1509 pages
      ISBN:9781450356572
      DOI:10.1145/3209978

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 June 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGIR '18 Paper Acceptance Rate86of409submissions,21%Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader